[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-27 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1260281685

   > can you please create a Jira corresponding to your investigation and link 
it in here? So that it's easier to discover it
   
   Yea, sure thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-26 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1258921709

   > Is number of files in your tables before or after?
   
   It's before, added after informations
   
   ### Test1
   
   Row enabled | Partition hour | total size | file num | total size(after) | 
file num(after) | runtime
   -- | -- | -- | -- | -- | -- | --
   true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 138.1 G | 47 | 753s
   false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K | 123.6 G | 43  | 1008s
   
   ### Test2
   
   Row enabled | Partition hour | total size | file num | total size(after) | 
file num(after)  | runtime
   -- | -- | -- | -- | -- | -- | --
   true | 2022-09-19 | 70.9 G | 7.5 K | 55.8G | 409 | 11h 7min
   false | 2022-09-20 | 69.7 G | 7.3 K | 54.5G | 397 |  11h 33min
   
   Yea, the second test has many small files, but it still confuse me why it so 
slow to write files whose average size is 250M. Still investigate why it 
happens.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-25 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1257404461

   ### Test1
   4 flat columns
   ```bash
   --num-executors 64 \
   --driver-memory 20g \
   --driver-cores 1 \
   --executor-memory 20g \ # rowEnable: 10g
   --executor-cores 1 \
   --class org.apache.hudi.utilities.HoodieClusteringJob \
   $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
   --mode scheduleAndExecute \
   --base-path $TABLEPATH \
   --table-name $TABLENAME \
   --spark-memory 20g \ # rowEnable: 10g
   --parallelism 64 \
   --hoodie-conf hoodie.clustering.async.enabled=true \
   --hoodie-conf hoodie.clustering.async.max.commits=0 \
   --hoodie-conf 
hoodie.clustering.plan.strategy.max.bytes.per.group=5368709120 \
   --hoodie-conf 
hoodie.clustering.plan.strategy.target.file.max.bytes=6442450944 \
   --hoodie-conf 
hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
   --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \
   ```
   
   Row enabled | Partition hour | total size | file num | runtime
   -- | -- | -- | -- | --
   true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 753s
   false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K| 1008s
   
   ### Test2
   23 columns, 9 nested columns, using z-order
   ```bash
   --conf 'spark.sql.parquet.columnarReaderBatchSize=2048' \
   --conf 'spark.yarn.maxAppAttempts=1' \
   --num-executors 32 \
   --driver-memory 20g \
   --driver-cores 1 \
   --executor-memory 30g \
   --executor-cores 1 \
   --class org.apache.hudi.utilities.HoodieClusteringJob \
   $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
   --mode scheduleAndExecute \
   --base-path $TABLEPATH \
   --table-name $TABLENAME \
   --spark-memory 30g \
   --parallelism 32 \
   --hoodie-conf hoodie.clustering.async.enabled=true \
   --hoodie-conf hoodie.clustering.async.max.commits=0 \
   --hoodie-conf 
hoodie.clustering.plan.strategy.target.file.max.bytes=209715200 \
   --hoodie-conf 
hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
   --hoodie-conf 
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
 \
   --hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
 \
   --hoodie-conf hoodie.layout.optimize.enable=true \
   --hoodie-conf hoodie.layout.optimize.strategy=z-order \
   --hoodie-conf 
hoodie.clustering.plan.strategy.sort.columns=applicationId,sparkUser
   ```
   
   Row enabled | Partition hour | total size | file num | runtime
   -- | -- | -- | -- | --
   true | 2022-09-19 | 70.9 G | 7.5 K | 11h 7min
   false | 2022-09-20 | 69.7 G | 7.3 K| 11h 33min
   
   The computing performance improved 20% to 30%, the bottleneck of this job is 
writing data, both jobs take approximate 10 hours at writing stage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-22 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254932365

   > Did you try to re-run your benchmark after the changes we've made? If so, 
can you please paste the results in here
   
   Sure, will rerun the benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-06 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1238883968

   Hey, @alexeykudinkin, addressed all comments, could you plz review again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-08-15 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214672107

   The CI failure seems not relate to the PR.
   
   Thanks to @voonhous, he tested 2 cases, cluster individual parquet files of  
~500MB up to 10GB groups.
   
   After enable `hoodie.clustering.as.row`, it could give us nearly 30% 
performance improvement
   
   ### Test 1
   | clustering as row enabled |Partition hour| total size | runtime(min) |
   | ::| ::|::|::|
   |true|dt=2022-07-28/hh=23|2.0T|76|
   |false|dt=2022-07-28/hh=00|2.0T|123|
   
   ### Test 2
   | clustering as row enabled |Partition hour| total size | File Count | 
runtime(min) |
   | ::| ::|::|::|::|
   |true|dt=2022-07-28/hh=14|2.5T|7792|92|
   |false|dt=2022-07-28/hh=15|2.5T|7771|128|
   
   The spark configure used
   
   ```bash
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 'spark.rpc.askTimeout=600s' \
   --conf 
'spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism=250'
 \
   --conf 'spark.sql.parquet.columnarReaderBatchSize=1024' \
   --conf 'spark.yarn.maxAppAttempts=1' \
   --num-executors 64 \
   --driver-memory 20g \
   --driver-cores 1 \
   --executor-memory 15g \
   --executor-cores 2 \
   --class org.apache.hudi.utilities.HoodieClusteringJob \
   hudi-utilities-bundle_2.12-0.12.0-SNAPSHOT.jar \
   --props 
hdfs://test/2022-07-24_clustering/clusteringjob_optimized.properties \
   --mode scheduleAndExecute \
   --base-path hdfs://test/test/hudi/voon_kafka_test__test_hudi_011_04/ \
   --table-name rank_server_log_hudi_test_1h \
   --spark-memory 15g \
   --parallelism 32
   ```
   
   clusteringjob.properties
   
   ```bash
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=2
   hoodie.clustering.plan.strategy.max.bytes.per.group=10737418240
   hoodie.clustering.plan.strategy.target.file.max.bytes=11811160064
   hoodie.clustering.plan.strategy.small.file.limit=6442450944
   hoodie.clustering.plan.strategy.max.num.groups=1
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
   hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
   hoodie.clustering.plan.strategy.cluster.begin.partition=dt=2022-07-28/hh=15
   hoodie.clustering.plan.strategy.cluster.end.partition=dt=2022-07-28/hh=15
   hoodie.clustering.plan.strategy.sort.columns=partition,offset
   ```
   
   Gentle ping @xiarixiaoyao @XuQianJin-Stars @codope, can you guys help to 
review this if you catch time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-08-01 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1201038722

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-07-14 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1185103626

   > @boneanxs my wechat 1037817390, let's disscuss this pr in wechat first. i 
think we can lanch this pr in 0.12
   
   Sure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-07-11 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1181254713

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org