[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1260281685 > can you please create a Jira corresponding to your investigation and link it in here? So that it's easier to discover it Yea, sure thing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1258921709 > Is number of files in your tables before or after? It's before, added after informations ### Test1 Row enabled | Partition hour | total size | file num | total size(after) | file num(after) | runtime -- | -- | -- | -- | -- | -- | -- true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 138.1 G | 47 | 753s false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K | 123.6 G | 43 | 1008s ### Test2 Row enabled | Partition hour | total size | file num | total size(after) | file num(after) | runtime -- | -- | -- | -- | -- | -- | -- true | 2022-09-19 | 70.9 G | 7.5 K | 55.8G | 409 | 11h 7min false | 2022-09-20 | 69.7 G | 7.3 K | 54.5G | 397 | 11h 33min Yea, the second test has many small files, but it still confuse me why it so slow to write files whose average size is 250M. Still investigate why it happens. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1257404461 ### Test1 4 flat columns ```bash --num-executors 64 \ --driver-memory 20g \ --driver-cores 1 \ --executor-memory 20g \ # rowEnable: 10g --executor-cores 1 \ --class org.apache.hudi.utilities.HoodieClusteringJob \ $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \ --mode scheduleAndExecute \ --base-path $TABLEPATH \ --table-name $TABLENAME \ --spark-memory 20g \ # rowEnable: 10g --parallelism 64 \ --hoodie-conf hoodie.clustering.async.enabled=true \ --hoodie-conf hoodie.clustering.async.max.commits=0 \ --hoodie-conf hoodie.clustering.plan.strategy.max.bytes.per.group=5368709120 \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=6442450944 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=1073741824 \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \ ``` Row enabled | Partition hour | total size | file num | runtime -- | -- | -- | -- | -- true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 753s false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K| 1008s ### Test2 23 columns, 9 nested columns, using z-order ```bash --conf 'spark.sql.parquet.columnarReaderBatchSize=2048' \ --conf 'spark.yarn.maxAppAttempts=1' \ --num-executors 32 \ --driver-memory 20g \ --driver-cores 1 \ --executor-memory 30g \ --executor-cores 1 \ --class org.apache.hudi.utilities.HoodieClusteringJob \ $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \ --mode scheduleAndExecute \ --base-path $TABLEPATH \ --table-name $TABLENAME \ --spark-memory 30g \ --parallelism 32 \ --hoodie-conf hoodie.clustering.async.enabled=true \ --hoodie-conf hoodie.clustering.async.max.commits=0 \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=209715200 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=1073741824 \ --hoodie-conf hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy \ --hoodie-conf hoodie.layout.optimize.enable=true \ --hoodie-conf hoodie.layout.optimize.strategy=z-order \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=applicationId,sparkUser ``` Row enabled | Partition hour | total size | file num | runtime -- | -- | -- | -- | -- true | 2022-09-19 | 70.9 G | 7.5 K | 11h 7min false | 2022-09-20 | 69.7 G | 7.3 K| 11h 33min The computing performance improved 20% to 30%, the bottleneck of this job is writing data, both jobs take approximate 10 hours at writing stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254932365 > Did you try to re-run your benchmark after the changes we've made? If so, can you please paste the results in here Sure, will rerun the benchmark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1238883968 Hey, @alexeykudinkin, addressed all comments, could you plz review again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214672107 The CI failure seems not relate to the PR. Thanks to @voonhous, he tested 2 cases, cluster individual parquet files of ~500MB up to 10GB groups. After enable `hoodie.clustering.as.row`, it could give us nearly 30% performance improvement ### Test 1 | clustering as row enabled |Partition hour| total size | runtime(min) | | ::| ::|::|::| |true|dt=2022-07-28/hh=23|2.0T|76| |false|dt=2022-07-28/hh=00|2.0T|123| ### Test 2 | clustering as row enabled |Partition hour| total size | File Count | runtime(min) | | ::| ::|::|::|::| |true|dt=2022-07-28/hh=14|2.5T|7792|92| |false|dt=2022-07-28/hh=15|2.5T|7771|128| The spark configure used ```bash --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.rpc.askTimeout=600s' \ --conf 'spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism=250' \ --conf 'spark.sql.parquet.columnarReaderBatchSize=1024' \ --conf 'spark.yarn.maxAppAttempts=1' \ --num-executors 64 \ --driver-memory 20g \ --driver-cores 1 \ --executor-memory 15g \ --executor-cores 2 \ --class org.apache.hudi.utilities.HoodieClusteringJob \ hudi-utilities-bundle_2.12-0.12.0-SNAPSHOT.jar \ --props hdfs://test/2022-07-24_clustering/clusteringjob_optimized.properties \ --mode scheduleAndExecute \ --base-path hdfs://test/test/hudi/voon_kafka_test__test_hudi_011_04/ \ --table-name rank_server_log_hudi_test_1h \ --spark-memory 15g \ --parallelism 32 ``` clusteringjob.properties ```bash hoodie.clustering.async.enabled=true hoodie.clustering.async.max.commits=2 hoodie.clustering.plan.strategy.max.bytes.per.group=10737418240 hoodie.clustering.plan.strategy.target.file.max.bytes=11811160064 hoodie.clustering.plan.strategy.small.file.limit=6442450944 hoodie.clustering.plan.strategy.max.num.groups=1 hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS hoodie.clustering.plan.strategy.cluster.begin.partition=dt=2022-07-28/hh=15 hoodie.clustering.plan.strategy.cluster.end.partition=dt=2022-07-28/hh=15 hoodie.clustering.plan.strategy.sort.columns=partition,offset ``` Gentle ping @xiarixiaoyao @XuQianJin-Stars @codope, can you guys help to review this if you catch time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1201038722 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1185103626 > @boneanxs my wechat 1037817390, let's disscuss this pr in wechat first. i think we can lanch this pr in 0.12 Sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1181254713 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org