Jungtaek Lim created SPARK-30900: ------------------------------------ Summary: FileStreamSource: Avoid reading compact metadata log twice if the query stops from compact batch and restarts Key: SPARK-30900 URL: https://issues.apache.org/jira/browse/SPARK-30900 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim
When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons. This case FileStreamSource will read the compact metadata file twice, one for retrieving all files to build seen file map, another one for retrieving entries in the batch. If the query processes huge number of inputs so far, compact metadata file becomes considerably bigger, so reading once more adds unnecessary latency on processing startup batch. This issue tracks the effort to address this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org