[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-566543290 @HeartSaVioR Would you mind if I ask to elaborate your answer? IMHO it's not clear which one (or both?) you are OK with. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-564409487 @uncleGen Hi, do you plan to go ahead with your idea? I have been thinking about this issue, and your idea seems to be a realistic solution which doesn't introduce too much changes. While we may also want to find the solution which could deal with most of things, but for now it would be great even only with your idea. Otherwise, would you mind if I pick your idea up if you're not planning to do it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-562557622 @tdas @zsxwing @jose-torres @gaborgsomogyi Kindly reminder. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-558976957 > IMHO, the core problem is the compact metadata log grows bigger and bigger, and it is a time-consuming work to compact the metadata log, because it will read old compact log file and then write to new compact log file. I agree with you that the problem is that compact metadata log just grows most of the times, though taking plenty of time building metadata log is just a one of multiple major issues. The other major issue, reading metadata log won't decrease unless we optimize the format of file or just get rid of entities like this patch is proposing. One thing we have to consider is, when `compact` phase happens, Spark is able to get rid of some entities which have been existing - that's the feature this patch leverages. That requires full read and rewrite of entities per each compact phase, and that's why we can't just simply add two compact files. Looks like `CompactibleFileStreamLog` is introduced to avoid "small files problem", which seems to be possible to tweak a bit to change the approach to maintain "ranged delta" which might be more similar with what you proposed. That's no longer be a "snapshot", but in most cases the entities are not removed so it also makes sense to me. I'm expecting the logic more complicated than current one, but that might be acceptable given the issue has been affecting badly for end users. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-558433816 Maybe we can differentiate two major cases: 1) downstream query to read the output directory is also Spark (leverages metadata) In this case, technically we never be able to delete any entries in metadata if we want to ensure the downstream query provides same result during multiple runs (unless inputs are added in real time). We know that's only ideal - if the streaming query runs longer and writes gigantic number/size of files for a long time, we would want to get rid of some part to gain speed and save storage with fully understanding that we are throwing out some inputs which will affect the result of query. Assume we decided to get rid of some output files. How to do it safely? The only safe way to do it is, getting rid of them in metadata first, and delete actual files. (Downstream query relies on the metadata to get the list of files, so if we don't make sure deleting them in metadata first, the downstream query will try to read the file which no longer exist, and fails - depending on the option.) That means running streaming query should deal with the deletion, as we don't have any official offline tool to modify metadata, and you may find difficulties to "how" to let streaming query know which files to delete. That's why I just simply pick "retention" which is generally acceptable approach (Kafka also applies retention policy by default). 2) we never let Spark read the output directory - we let other frameworks to read the directory In this case we don't need to build metadata - though this means end users will need to deal with "at-least-once" guarantee. Given the file sink doesn't overwrite the file, it may leave corrupted records on partial output as well. If that's acceptable, we may be able to add an option to "disable" metadata, though there was some comments worried about doing it: https://github.com/apache/spark/pull/24128#issuecomment-474109068 So I guess there're not many options here and I guess I picked the viable one, but I'd be really appreciated for more ideas! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-557398360 SPARK-29995 is just filed which denotes same issues SPARK-24295. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-534717837 retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-534461165 retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-529248401 Ping. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-523201466 @tdas @zsxwing @jose-torres @gaborgsomogyi Kindly reminder. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-488107523 Ping again, as Spark+AI Summit 2019 in SF is end. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-480482791 Kindly reminder. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-476025148 Could I kindly ask for reviewing on new approach? That would not be intrusive unless end users configure the retention badly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-474225477 retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-474153829 Rebased to the approach: applying retention. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org