Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
danny0405 closed issue #10418: hoodie.bulkinsert.shuffle.parallelism Not activated URL: https://github.com/apache/hudi/issues/10418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
KnightChess commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1903289437 #10532 fix `hoodie.bulkinsert.shuffle.parallelism` can not work when dedup source. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
KnightChess commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880500598 @zhangjw123321 I create a issue to track it, https://issues.apache.org/jira/browse/HUDI-7277 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880457371 I try to set the number of files that can be generated normally. Thank you very much. @KnightChess @ad1happy2go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880455240 > @zhangjw123321 you can try set it in spark submit, --conf, or by code sparkconf.set('xxx','yyy'), will match other branch, not use parent rdd partition size ![image](https://private-user-images.githubusercontent.com/20125927/294465569-4b21cb55-3bd6-471e-92d4-e3dade5eafaf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTU4MTIsIm5iZiI6MTcwNDY5NTUxMiwicGF0aCI6Ii8yMDEyNTkyNy8yOTQ0NjU1NjktNGIyMWNiNTUtM2JkNi00NzFlLTkyZDQtZTNkYWRlNWVhZmFmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA2MzE1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFmMTYzZDU3MjU4YWFhMzM1ODhhODZlNDEyNTQ1YzVjMTMxOTI2OTBhNDJhNWVjNWMwODNiMmI3ZTFmYmNlMDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.SHT6hRR8KFd8GuWlM2E5YPX_sKfy7NFrEpVH97OX0m0) I try to set the number of files that can be generated normally. Thank you very much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1880454835 > @zhangjw123321 you can try set it in spark submit, --conf, or by code sparkconf.set('xxx','yyy'), will match other branch, not use parent rdd partition size ![image](https://private-user-images.githubusercontent.com/20125927/294465569-4b21cb55-3bd6-471e-92d4-e3dade5eafaf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTU4MTIsIm5iZiI6MTcwNDY5NTUxMiwicGF0aCI6Ii8yMDEyNTkyNy8yOTQ0NjU1NjktNGIyMWNiNTUtM2JkNi00NzFlLTkyZDQtZTNkYWRlNWVhZmFmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA2MzE1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFmMTYzZDU3MjU4YWFhMzM1ODhhODZlNDEyNTQ1YzVjMTMxOTI2OTBhNDJhNWVjNWMwODNiMmI3ZTFmYmNlMDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.SHT6hRR8KFd8GuWlM2E5YPX_sKfy7NFrEpVH97OX0m0) I try to set the number of files that can be generated normally. Thank you very much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
KnightChess commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878465493 @zhangjw123321 you can try set it in spark submit, --conf, or by code sparkconf.set('xxx','yyy'), will match other branch, not use parent rdd size ![image](https://github.com/apache/hudi/assets/20125927/4b21cb55-3bd6-471e-92d4-e3dade5eafaf) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878422256 > @zhangjw123321 I test in my local, `spark.default.parallelism` look like can not effect in sql set, can you set when submit spark job, like --conf. before try it, how much cores executors have total when this job execute? 1 cores? or ods.ods_company where dt='2023-12-15' have 1 files? 1.Use --executor-cores 10 --num-executors 20 or --executor-cores 10 --num-executors 10 2.Yes,1 FIles。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
KnightChess commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878402401 @zhangjw123321 I test in my local, `spark.default.parallelism` look like can not effect in sql set, can you set when submit spark job, like --conf. before try it, how much cores executors have total when this job execute? 1 cores? or ods.ods_company where dt='2023-12-15' have 1 files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878335425 ![image](https://github.com/apache/hudi/assets/154970920/2d9d84ce-7c13-4264-a380-4396ab767d98) ![image](https://github.com/apache/hudi/assets/154970920/dc297f40-a49a-4023-95a4-3a57a48c9ac1) set hoodie.spark.sql.insert.into.operation=bulk_insert; set hoodie.bulkinsert.shuffle.parallelism=100; set spark.default.parallelism=100; set spark.sql.shuffle.partitions=100; After these parameters are used, the hdfs hudi file is still 1 > @zhangjw123321 look like `hoodie.bulkinsert.shuffle.parallelism` can not work on non-partitioned table in the code. In the spark ui, may be you not set `spark.default.parallelism` so `reduceBykey` will use the parent rdd partitions size. Can you try `set spark.default.parallelism=100;` I think it will reduce the parallelism in `stage 10` to 100. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
gtzysh commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878168602 I think hoodie.bulkinsert.sort.mode is helpful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
KnightChess commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1877175659 @zhangjw123321 look like `hoodie.bulkinsert.shuffle.parallelism` can not work on non-partitioned table in the code. In the spark ui, may be you not set `spark.default.parallelism` so `reduceBykey` will use the parrent rdd partitions size. Can you try `set spark.default.parallelism=100;` I think it will reduce the parallelism in `stage 10` to 100. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1874749412 通过这个链接下载https://dlcdn.apache.org/hudi/0.14.0/hudi-0.14.0.src.tgz,maven编辑的hudi-spark3.2-bundle_2.12-0.14.0.jar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1874749317 Which stage is deleting duplicate records,Other than the above configuration, no other configuration is manually set。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
ad1happy2go commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1874091034 @zhangjw123321 Its going in deduping records. For bulk insert it doesn't dedup with the default configs. Are you setting any other configs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1873836778 ![image](https://github.com/apache/hudi/assets/154970920/084e9134-9356-4c15-b489-4a420cfcbec2) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1873811228 ![image](https://github.com/apache/hudi/assets/154970920/fa708f0b-f7ad-45a7-8cb3-ddf538504668) ![image](https://github.com/apache/hudi/assets/154970920/6c9264cb-4821-4af5-b95b-619ce07c826c) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
ad1happy2go commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1871871612 @zhangjw123321 This is for number of partitions after shuffle stage. Can you show the spark DAG once. How many stages it is creating? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870779093 First of all, thank you very much for your reply. Advanced configuration is introduced here, after I configure this parameter is not effective, I think the normal configuration parameters, hudi table file number is and (hoodie. bulkinsert. shuffle. parallelism) parameters of concurrent number consistent, the reason of this parameter is not effective, please what? ![image](https://github.com/apache/hudi/assets/154970920/758b3e03-6a28-4ff8-8ea7-b2b4d77f6c83) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870777405 ![image](https://github.com/apache/hudi/assets/154970920/b622352a-4680-471d-b825-34535cc0a126) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870775372 This dataset is imported by mysql into the hive table,Use sqoop-m 1。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
ad1happy2go commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1870274575 @zhangjw123321 The number of input partition/spark tasks is derived from the input dataset. How many tasks are getting created and what is the nature of source of dataset? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] hoodie.bulkinsert.shuffle.parallelism Not activated [hudi]
zhangjw123321 opened a new issue, #10418: URL: https://github.com/apache/hudi/issues/10418 **Describe the problem you faced** 1.source table (ods.ods_company) is 1w files, 2.set hoodie.bulkinsert.shuffle.parallelism=100 Not activated, 3.insert into hudi table after ,hudi table is 1w files, set hoodie.bulkinsert.shuffle.parallelism Not activated, The correct number is 100 files,not 1w files。 **To Reproduce** Steps to reproduce the behavior: 1./opt/software/spark-3.2.1/bin/spark-sql \ --master yarn --conf spark.ui.port=4049 \ --conf spark.ui.showConsoleProgress=true \ --conf spark.hadoop.hive.cli.print.header=true \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \ --queue root.hdfs \ --driver-memory 5g \ --executor-memory 20g \ --executor-cores 10 \ --num-executors 20 2.CREATE TABLE IF NOT EXISTS hudi_ods.ods_company( id bigint, * )using hudi tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'dt' ) 3. set hoodie.spark.sql.insert.into.operation=bulk_insert; set hoodie.bulkinsert.shuffle.parallelism=100; 4. insert into table hudi_ods.ods_company select * from ods.ods_company where dt='2023-12-15'; **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.14 * Spark version :3.2 * Hive version :2.3.1 * Hadoop version :2.10 * Storage (HDFS/S3/GCS..) :HDFS * Running on Docker? (yes/no) :no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org