HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-558976957 > IMHO, the core problem is the compact metadata log grows bigger and bigger, and it is a time-consuming work to compact the metadata log, because it will read old compact log file and then write to new compact log file. I agree with you that the problem is that compact metadata log just grows most of the times, though taking plenty of time building metadata log is just a one of multiple major issues. The other major issue, reading metadata log won't decrease unless we optimize the format of file or just get rid of entities like this patch is proposing. One thing we have to consider is, when `compact` phase happens, Spark is able to get rid of some entities which have been existing - that's the feature this patch leverages. That requires full read and rewrite of entities per each compact phase, and that's why we can't just simply add two compact files. Looks like `CompactibleFileStreamLog` is introduced to avoid "small files problem", which seems to be possible to tweak a bit to change the approach to maintain "ranged delta" which might be more similar with what you proposed. That's no longer be a "snapshot", but in most cases the entities are not removed so it also makes sense to me. I'm expecting the logic more complicated than current one, but that might be acceptable given the issue has been affecting badly for end users.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org