Jungtaek Lim created SPARK-30866:
------------------------------------

             Summary: FileStreamSource: Cache fetched list of files beyond 
maxFilesPerTrigger as unread files
                 Key: SPARK-30866
                 URL: https://issues.apache.org/jira/browse/SPARK-30866
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


FileStreamSource fetches the available files per batch which is a "heavy cost" 
operation.

(E.g. It took around 5 seconds to list leaf files for 95 paths which contain 
674,811 files. It's not even in HDFS path - it's local filesystem.)

If "maxFilesPerTrigger" is not set, Spark would consume all the fetched files. 
After the batch has been completed, it's obvious for Spark to fetch per micro 
batch.

If "latestFirst" is true (regardless of "maxFilesPerTrigger"), the files to 
process should be updated per batch, so it's also obvious for Spark to fetch 
per micro batch.

Except above cases (in short, maxFilesPerTrigger is being set and latestFirst 
is false), the files to process can be "continuous" - we can "cache" the 
fetched list of files and consume until the list has been exhausted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to