[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

GitBox Wed, 27 Nov 2019 00:09:58 -0800

HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-558976957
 
 
   > IMHO, the core problem is the compact metadata log grows bigger and 
bigger, and it is a time-consuming work to compact the metadata log, because it 
will read old compact log file and then write to new compact log file.
   
   I agree with you that the problem is that compact metadata log just grows 
most of the times, though taking plenty of time building metadata log is just a 
one of multiple major issues. The other major issue, reading metadata log won't 
decrease unless we optimize the format of file or just get rid of entities like 
this patch is proposing.
   
   One thing we have to consider is, when `compact` phase happens, Spark is 
able to get rid of some entities which have been existing - that's the feature 
this patch leverages. That requires full read and rewrite of entities per each 
compact phase, and that's why we can't just simply add two compact files.
   
   Looks like `CompactibleFileStreamLog` is introduced to avoid "small files 
problem", which seems to be possible to tweak a bit to change the approach to 
maintain "ranged delta" which might be more similar with what you proposed. 
That's no longer be a "snapshot", but in most cases the entities are not 
removed so it also makes sense to me. I'm expecting the logic more complicated 
than current one, but that might be acceptable given the issue has been 
affecting badly for end users.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

Reply via email to