Jungtaek Lim created SPARK-30900:
------------------------------------

             Summary: FileStreamSource: Avoid reading compact metadata log 
twice if the query stops from compact batch and restarts
                 Key: SPARK-30900
                 URL: https://issues.apache.org/jira/browse/SPARK-30900
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


When restarting the query, there is a case which the query starts from 
compaction batch, and the batch has source metadata file to read. One case is 
that the previous query succeeded to read from inputs, but not finalized the 
batch for various reasons.

This case FileStreamSource will read the compact metadata file twice, one for 
retrieving all files to build seen file map, another one for retrieving entries 
in the batch. If the query processes huge number of inputs so far, compact 
metadata file becomes considerably bigger, so reading once more adds 
unnecessary latency on processing startup batch.

This issue tracks the effort to address this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to