hudi-bot opened a new issue, #14684: URL: https://github.com/apache/hudi/issues/14684
Since DeltaStreamer makes heavily use of file listing, if the source contains a lot of tiny files, this could quickly become a bottle neck. We need a way to delete/archive files once processed by DeltaStreamer. It seems like the best way to reliably clean up the source is after DeltaSync commit the checkpoint successfully. We could add a new public method to Source e.g. `postCommit()` and invoke it after each successful commit Reference: [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources] ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-1348 - Type: Improvement --- ## Comments 21/Oct/20 12:59;vho;[~vinoth] what do you think about this? I'm not sure if there's already a solution ;;; --- 24/Oct/20 02:12;vinoth;[~vho] thanks For bringing this up. It seems valuable to support such an option. I am not aware of any other work along these lines. [~bhasudha] is looking into parallelism for input source listing;;; --- 08/Aug/21 20:17;githubbot;hudi-bot edited a comment on pull request #2210: URL: https://github.com/apache/hudi/pull/2210#issuecomment-862028641 <!-- Meta data { "version" : 1, "metaDataEntries" : [ { "hash" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "status" : "FAILURE", "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210", "triggerID" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "triggerType" : "PUSH" } ] }--> ## CI report: * b845e34d11e4e44e2b41e2089349baddc3a10b80 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210) <details> <summary>Bot commands</summary> The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] ;;; --- 09/Aug/21 04:21;githubbot;hudi-bot edited a comment on pull request #2210: URL: https://github.com/apache/hudi/pull/2210#issuecomment-862028641 <!-- Meta data { "version" : 1, "metaDataEntries" : [ { "hash" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "status" : "FAILURE", "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210", "triggerID" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "triggerType" : "PUSH" } ] }--> ## CI report: * b845e34d11e4e44e2b41e2089349baddc3a10b80 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210) <details> <summary>Bot commands</summary> @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] ;;; --- 11/Aug/21 22:25;githubbot;hudi-bot edited a comment on pull request #2210: URL: https://github.com/apache/hudi/pull/2210#issuecomment-862028641 <!-- Meta data { "version" : 1, "metaDataEntries" : [ { "hash" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "status" : "FAILURE", "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210", "triggerID" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "triggerType" : "PUSH" }, { "hash" : "a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25", "status" : "UNKNOWN", "url" : "TBD", "triggerID" : "a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25", "triggerType" : "PUSH" } ] }--> ## CI report: * b845e34d11e4e44e2b41e2089349baddc3a10b80 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210) * a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25 UNKNOWN <details> <summary>Bot commands</summary> @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] ;;; --- 11/Aug/21 22:28;githubbot;hudi-bot edited a comment on pull request #2210: URL: https://github.com/apache/hudi/pull/2210#issuecomment-862028641 <!-- Meta data { "version" : 1, "metaDataEntries" : [ { "hash" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "status" : "FAILURE", "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210", "triggerID" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "triggerType" : "PUSH" }, { "hash" : "a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25", "status" : "PENDING", "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1668", "triggerID" : "a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25", "triggerType" : "PUSH" } ] }--> ## CI report: * b845e34d11e4e44e2b41e2089349baddc3a10b80 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210) * a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1668) <details> <summary>Bot commands</summary> @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] ;;; --- 11/Aug/21 23:27;githubbot;hudi-bot edited a comment on pull request #2210: URL: https://github.com/apache/hudi/pull/2210#issuecomment-862028641 <!-- Meta data { "version" : 1, "metaDataEntries" : [ { "hash" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "status" : "DELETED", "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=210", "triggerID" : "b845e34d11e4e44e2b41e2089349baddc3a10b80", "triggerType" : "PUSH" }, { "hash" : "a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25", "status" : "FAILURE", "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1668", "triggerID" : "a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25", "triggerType" : "PUSH" } ] }--> ## CI report: * a174c4ed2b4c13a032a38afdb0a21b58a7b6cf25 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1668) <details> <summary>Bot commands</summary> @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] ;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
