Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-22 Thread via GitHub


danny0405 closed issue #10418: hoodie.bulkinsert.shuffle.parallelism   Not 
activated
URL: https://github.com/apache/hudi/issues/10418


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-21 Thread via GitHub


KnightChess commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1903289437

   #10532 fix `hoodie.bulkinsert.shuffle.parallelism` can not work when dedup 
source.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-07 Thread via GitHub


KnightChess commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880500598

   @zhangjw123321 I create a issue to track it, 
https://issues.apache.org/jira/browse/HUDI-7277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-07 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880457371

   I try to set the number of files that can be generated normally.
   Thank you very much. @KnightChess @ad1happy2go 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-07 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880455240

   > @zhangjw123321 you can try set it in spark submit, --conf, or by code 
sparkconf.set('xxx','yyy'), will match other branch, not use parent rdd 
partition size 
![image](https://private-user-images.githubusercontent.com/20125927/294465569-4b21cb55-3bd6-471e-92d4-e3dade5eafaf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTU4MTIsIm5iZiI6MTcwNDY5NTUxMiwicGF0aCI6Ii8yMDEyNTkyNy8yOTQ0NjU1NjktNGIyMWNiNTUtM2JkNi00NzFlLTkyZDQtZTNkYWRlNWVhZmFmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA2MzE1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFmMTYzZDU3MjU4YWFhMzM1ODhhODZlNDEyNTQ1YzVjMTMxOTI2OTBhNDJhNWVjNWMwODNiMmI3ZTFmYmNlMDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.SHT6hRR8KFd8GuWlM2E5YPX_sKfy7NFrEpVH97OX0m0)
   
   I try to set the number of files that can be generated normally.
   Thank you very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-07 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880454835

   > @zhangjw123321 you can try set it in spark submit, --conf, or by code 
sparkconf.set('xxx','yyy'), will match other branch, not use parent rdd 
partition size 
![image](https://private-user-images.githubusercontent.com/20125927/294465569-4b21cb55-3bd6-471e-92d4-e3dade5eafaf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTU4MTIsIm5iZiI6MTcwNDY5NTUxMiwicGF0aCI6Ii8yMDEyNTkyNy8yOTQ0NjU1NjktNGIyMWNiNTUtM2JkNi00NzFlLTkyZDQtZTNkYWRlNWVhZmFmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA2MzE1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFmMTYzZDU3MjU4YWFhMzM1ODhhODZlNDEyNTQ1YzVjMTMxOTI2OTBhNDJhNWVjNWMwODNiMmI3ZTFmYmNlMDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.SHT6hRR8KFd8GuWlM2E5YPX_sKfy7NFrEpVH97OX0m0)
   
   I try to set the number of files that can be generated normally.
   Thank you very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-05 Thread via GitHub


KnightChess commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878465493

   @zhangjw123321 you can try set it in spark submit, --conf, or by code 
sparkconf.set('xxx','yyy'), will match other branch, not  use parent rdd size
   
![image](https://github.com/apache/hudi/assets/20125927/4b21cb55-3bd6-471e-92d4-e3dade5eafaf)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-05 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878422256

   > @zhangjw123321 I test in my local, `spark.default.parallelism` look like 
can not effect in sql set, can you set when submit spark job, like --conf. 
before try it, how much cores executors have total when this job execute? 1 
cores? or ods.ods_company where dt='2023-12-15' have 1 files?
   1.Use
   --executor-cores 10
   --num-executors 20
   or 
   --executor-cores 10
   --num-executors 10
   2.Yes,1 FIles。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-05 Thread via GitHub


KnightChess commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878402401

   @zhangjw123321 I test in my local, `spark.default.parallelism` look like can 
not effect in sql set, can you set when submit spark job, like --conf. before 
try it, how much cores executors have total when this job execute? 1 cores? 
or ods.ods_company where dt='2023-12-15' have 1 files?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-05 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878335425

   
![image](https://github.com/apache/hudi/assets/154970920/2d9d84ce-7c13-4264-a380-4396ab767d98)
   
![image](https://github.com/apache/hudi/assets/154970920/dc297f40-a49a-4023-95a4-3a57a48c9ac1)
   set hoodie.spark.sql.insert.into.operation=bulk_insert;
   set hoodie.bulkinsert.shuffle.parallelism=100;
   set spark.default.parallelism=100;
   set spark.sql.shuffle.partitions=100;
   After these parameters are used, the hdfs hudi file is still 1
   
   > @zhangjw123321 look like `hoodie.bulkinsert.shuffle.parallelism` can not 
work on non-partitioned table in the code. In the spark ui, may be you not set 
`spark.default.parallelism` so `reduceBykey` will use the parent rdd partitions 
size. Can you try `set spark.default.parallelism=100;` I think it will reduce 
the parallelism in `stage 10` to 100.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-04 Thread via GitHub


gtzysh commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878168602

   I think hoodie.bulkinsert.sort.mode is helpful 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-04 Thread via GitHub


KnightChess commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1877175659

   @zhangjw123321 look like `hoodie.bulkinsert.shuffle.parallelism` can not 
work on non-partitioned table in the code. In the spark ui, may be you not set 
`spark.default.parallelism` so `reduceBykey` will use the parrent rdd 
partitions size. Can you try `set spark.default.parallelism=100;` I think it 
will reduce the parallelism in `stage 10` to 100.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-02 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1874749412

   
   
通过这个链接下载https://dlcdn.apache.org/hudi/0.14.0/hudi-0.14.0.src.tgz,maven编辑的hudi-spark3.2-bundle_2.12-0.14.0.jar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-02 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1874749317

   Which stage is deleting duplicate records,Other than the above 
configuration, no other configuration is manually set。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-02 Thread via GitHub


ad1happy2go commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1874091034

   @zhangjw123321 Its going in deduping records. For bulk insert it doesn't 
dedup with the default configs. Are you setting any other configs? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-02 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1873836778

   
![image](https://github.com/apache/hudi/assets/154970920/084e9134-9356-4c15-b489-4a420cfcbec2)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2024-01-02 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1873811228

   
![image](https://github.com/apache/hudi/assets/154970920/fa708f0b-f7ad-45a7-8cb3-ddf538504668)
   
![image](https://github.com/apache/hudi/assets/154970920/6c9264cb-4821-4af5-b95b-619ce07c826c)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2023-12-29 Thread via GitHub


ad1happy2go commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1871871612

   @zhangjw123321 This is for number of partitions after shuffle stage. Can you 
show the spark DAG once. How many stages it is creating?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2023-12-27 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870779093

   First of all, thank you very much for your reply.
   Advanced configuration is introduced here, after I configure this parameter 
is not effective, I think the normal configuration parameters, hudi table file 
number is and (hoodie. bulkinsert. shuffle. parallelism) parameters of 
concurrent number consistent, the reason of this parameter is not effective, 
please what?
   
![image](https://github.com/apache/hudi/assets/154970920/758b3e03-6a28-4ff8-8ea7-b2b4d77f6c83)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2023-12-27 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870777405

   
![image](https://github.com/apache/hudi/assets/154970920/b622352a-4680-471d-b825-34535cc0a126)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2023-12-27 Thread via GitHub


zhangjw123321 commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870775372

   This dataset is imported by mysql into the hive table,Use sqoop-m 1。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2023-12-27 Thread via GitHub


ad1happy2go commented on issue #10418:
URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870274575

   @zhangjw123321 The number of input partition/spark tasks is derived from the 
input dataset. How many tasks are getting created and what is the nature of 
source of dataset?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]

2023-12-27 Thread via GitHub


zhangjw123321 opened a new issue, #10418:
URL: https://github.com/apache/hudi/issues/10418

   **Describe the problem you faced**
   
   1.source table (ods.ods_company) is 1w files,
   2.set hoodie.bulkinsert.shuffle.parallelism=100   Not activated,
   3.insert into hudi table after ,hudi table is 1w files,
   set hoodie.bulkinsert.shuffle.parallelism   Not activated,
   The correct number is 100 files,not 1w files。
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1./opt/software/spark-3.2.1/bin/spark-sql \
   --master yarn --conf spark.ui.port=4049 \
   --conf spark.ui.showConsoleProgress=true \
   --conf spark.hadoop.hive.cli.print.header=true \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
   --queue root.hdfs \
   --driver-memory 5g \
   --executor-memory 20g \
   --executor-cores 10 \
   --num-executors 20
   2.CREATE  TABLE IF NOT EXISTS hudi_ods.ods_company(
   id bigint,
   *
   )using hudi
   tblproperties (
 type = 'cow',
 primaryKey = 'id',
 preCombineField = 'dt'
)
   3.
   set hoodie.spark.sql.insert.into.operation=bulk_insert;
   set hoodie.bulkinsert.shuffle.parallelism=100;
   4.
   insert into table hudi_ods.ods_company
   select * from ods.ods_company where dt='2023-12-15';
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.14
   
   * Spark version :3.2
   
   * Hive version :2.3.1
   
   * Hadoop version :2.10
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org