[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2022-01-14 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r785259770



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   In addition, I changed the reading order of deltalog to avoid data 
rewriting to the greatest extent. Houdierecordpayload#precombine will still 
execute and select the correct data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2022-01-14 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r785259289



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   > I think you are assuming the later writes in the log always overwrites 
the earlier ones? this is not true always.
   
   In the compact plan generation phase, I just changed the order of reading 
delta log files. In the internal production environment, I have used this 
method for a month, and no data exceptions have occurred. Now, I don't know how 
I should test this place. Can you give me some suggestions




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2022-01-14 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r785259224



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   > I think you are assuming the later writes in the log always overwrites 
the earlier ones? this is not true always.
   In the compact plan generation phase, I just changed the order of reading 
delta log files. In the internal production environment, I have used this 
method for a month, and no data exceptions have occurred(cluster、clean、compact 
all inline). Now, I don't know how I should test this place. Can you give me 
some suggestions
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2021-12-20 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r772869290



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   > Have a clarification on the first fix. Could you add some UTs for this?
   
   OK, I'll try to add some UTs




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2021-12-20 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r772868883



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   You're right, but in most cases, the new data is often in the latest 
delta log, so we sort it from large to small according to the instance time. 
The program will avoid updating the data in the externalspillablemap to save 
compact time. What do you think




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org