[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table
satishkotha edited a comment on issue #1866: URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323 > Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor. clean and archival are somewhat independent today. So this 'noop' should not happen. > Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability Why are you considering separate spark job for clean? Are you seeing clean take a lot of time? You can consider running clean concurrently with write by setting 'hoodie.clean.async' to true. (This runs clean in same job, but concurrently with write). I don't know of anyone using separate spark job to run clean. Theoretically, I think it is possible. But you may have to do some testing because it isn't used like this afaik. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table
satishkotha edited a comment on issue #1866: URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323 > Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor. clean and archival are somewhat independent today. So this 'noop' should not happen. > Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability Why are you considering separate spark job for clean? Are you seeing clean take a lot of time? You can consider running clean concurrently with write by setting 'hoodie.clean.async' to true. (This runs clean in same job, but concurrently with write). I don't know of anyone using separate spark job to run clean. Theoretically, I think it is possible. But you may have to do some testing because it isn't used like this afaik. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table
satishkotha edited a comment on issue #1866: URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323 > Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor. clean and archival are somewhat independent. So noop should not happen. > Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability Why are you considering separate spark job for clean? Are you seeing clean take a lot of time? You can consider running clean concurrently with write by setting 'hoodie.clean.async' to true. (This runs clean in same job, but concurrently with write). I don't know of anyone using separate spark job to run clean. Theoretically, I think it is possible. But you may have to do some testing because it isn't used like this afaik. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table
satishkotha edited a comment on issue #1866: URL: https://github.com/apache/hudi/issues/1866#issuecomment-663298411 Hi @luffyd By default, upsert on MOR tables creates 'deltacommits'. [Compaction](https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture#DesignAndArchitecture-Compaction) needs to run to convert deltacommits into commits. Clean works only after compaction runs and commits are created. Clean also does not remove file groups that have pending compaction. Can you setup inline compaction [using instructions here](https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIruncompactionforaMORdataset) for testing and see if that helps? If that doesn't work, can you share screenshot of files in .hoodie folder in 'getHudiPath' This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org