[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve
scxwhite commented on a change in pull request #4400: URL: https://github.com/apache/hudi/pull/4400#discussion_r785259770 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // We can think that the latest data is in the latest delta log file, so we sort it from large Review comment: In addition, I changed the reading order of deltalog to avoid data rewriting to the greatest extent. Houdierecordpayload#precombine will still execute and select the correct data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve
scxwhite commented on a change in pull request #4400: URL: https://github.com/apache/hudi/pull/4400#discussion_r785259289 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // We can think that the latest data is in the latest delta log file, so we sort it from large Review comment: > I think you are assuming the later writes in the log always overwrites the earlier ones? this is not true always. In the compact plan generation phase, I just changed the order of reading delta log files. In the internal production environment, I have used this method for a month, and no data exceptions have occurred. Now, I don't know how I should test this place. Can you give me some suggestions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve
scxwhite commented on a change in pull request #4400: URL: https://github.com/apache/hudi/pull/4400#discussion_r785259224 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // We can think that the latest data is in the latest delta log file, so we sort it from large Review comment: > I think you are assuming the later writes in the log always overwrites the earlier ones? this is not true always. In the compact plan generation phase, I just changed the order of reading delta log files. In the internal production environment, I have used this method for a month, and no data exceptions have occurred(cluster、clean、compact all inline). Now, I don't know how I should test this place. Can you give me some suggestions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve
scxwhite commented on a change in pull request #4400: URL: https://github.com/apache/hudi/pull/4400#discussion_r772869290 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // We can think that the latest data is in the latest delta log file, so we sort it from large Review comment: > Have a clarification on the first fix. Could you add some UTs for this? OK, I'll try to add some UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve
scxwhite commented on a change in pull request #4400: URL: https://github.com/apache/hudi/pull/4400#discussion_r772868883 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // We can think that the latest data is in the latest delta log file, so we sort it from large Review comment: You're right, but in most cases, the new data is often in the latest delta log, so we sort it from large to small according to the instance time. The program will avoid updating the data in the externalspillablemap to save compact time. What do you think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org