Github user frreiss commented on the issue: https://github.com/apache/spark/pull/13513 Ah, now I fully understand @zsxwing's earlier comment about the semantics of the semantics of `Source.getBatch()`. Those semantics have a design flaw; see the email thread I started at http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-tt18551.html. Basically, it's impossible to implement a Source to the written API spec without keeping unbounded state. I have an open PR to fix this problem at https://github.com/apache/spark/pull/14553. In the short run, I think that @jerryshao's changes here are ok with respect to `Source.getBatch`. The approach in this PR will work as long as the internal structure of the `StreamExecution` class doesn't change and as long as Spark does not have to recover from an outage longer than the compaction interval. The recent changes to `FileInputStream` under SPARK-17165 (https://github.com/apache/spark/pull/14728) have the same problem, and those changes are already committed.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org