[ https://issues.apache.org/jira/browse/SPARK-30866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-30866. ----------------------------------- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27620 [https://github.com/apache/spark/pull/27620] > FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as > unread files > --------------------------------------------------------------------------------------- > > Key: SPARK-30866 > URL: https://issues.apache.org/jira/browse/SPARK-30866 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 3.1.0 > Reporter: Jungtaek Lim > Assignee: Jungtaek Lim > Priority: Major > Fix For: 3.1.0 > > > FileStreamSource fetches the available files per batch which is a "heavy > cost" operation. > (E.g. It took around 5 seconds to list leaf files for 95 paths which contain > 674,811 files. It's not even in HDFS path - it's local filesystem.) > If "maxFilesPerTrigger" is not set, Spark would consume all the fetched > files. After the batch has been completed, it's obvious for Spark to fetch > per micro batch. > If "latestFirst" is true (regardless of "maxFilesPerTrigger"), the files to > process should be updated per batch, so it's also obvious for Spark to fetch > per micro batch. > Except above cases (in short, maxFilesPerTrigger is being set and latestFirst > is false), the files to process can be "continuous" - we can "cache" the > fetched list of files and consume until the list has been exhausted. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org