Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1906993415 Yes that works -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1895937269 RECORD_INDEX is not working with bulk_upsert. How do we handle to load initial history. Its taking forever to load. Any recommendations for bulk_upsert to load 500GB for initial load? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
ad1happy2go commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1894910852 @darlatrade Did the suggestion worked? DO you need any other help here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
ad1happy2go commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1812124131 @darlatrade You need to increase [hoodie.metadata.record.index.min.filegroup.count](https://hudi.apache.org/docs/configurations/#hoodiemetadatarecordindexminfilegroupcount) to a higher number depending upon the amount of data you have. Let us know if it helps. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1805844791 I am trying to initialize new table with RLI. I need to load the history first which is has 3210407531 records and 520GB data. Spark context is shutting down to load this much data..Also number objects are huge as below screenshot https://github.com/apache/hudi/assets/109939327/34b21e41-5631-498a-a15f-d3d71ba728c3";> **Hoodie config :** "className": "org.apache.hudi", "hoodie.table.name": tgt_tbl, "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.precombine.field": "eff_fm_cent_tz", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "year,month", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.assume_date_partitioning": "false", "hoodie.datasource.hive_sync.table": tgt_tbl, "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.bulkinsert.shuffle.parallelism": hudi_Insert_parallelism, "hoodie.index.type": "RECORD_INDEX", "hoodie.metadata.record.index.enable":"true", "hoodie.metadata.enable": "true" Error: https://github.com/apache/hudi/assets/109939327/ec7c203d-b720-427a-91e0-c4e7a43615a0";> Can you suggest what parameters need to be used to load this data? I need to load history first before starting deltas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
soumilshah1995 commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1804545703 well Apache Hudi 0.14 you can leverage RLI for faster UPSERT ``` 'hoodie.metadata.record.index.enable': 'true', 'hoodie.index.type':'RECORD_INDEX' ``` sample code can be found https://soumilshah1995.blogspot.com/2023/10/apache-hudi-014-announces.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1804498900 what are the hadoop configs to be considered to load 500GB of data in monthly partitions for RLI and 0.14? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
nsivabalan commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1803119793 and yeah. upgrading to 0.14.0, you can leverage RLI and that should def boost your index latencies and write latencies. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
nsivabalan commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1803119446 yeah. As I suggested before, you may want to try our MOR table. and try using SIMPLE index. in 0.10.1 hudi uses bloom index and for random keys it might incur some unnecessary overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
soumilshah1995 commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1802964892 can you use new RLI ? https://www.linkedin.com/pulse/upsert-performance-evaluation-hudi-014-spark-341-record-soumil-shah-oupre%3FtrackingId=PeKhUkGNTkuSD1VRqoI3rw%253D%253D/?trackingId=PeKhUkGNTkuSD1VRqoI3rw%3D%3D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1797388578 @nsivabalan Any inputs on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793304296 Here is how "id" is derived. df.withColumn("id", concat("evnt_cent_tz",lit("_"),md5(concat("key_col1","key_col2","key_col3","evnt_cent_tz" Sample values from table: https://github.com/apache/hudi/assets/109939327/ec0655cf-766f-4ec5-93d9-a857aaf9d910";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
nsivabalan commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793285715 got it. may I know whats your record key comprises of. I mean, I see it as "id". but is it random id or does it refer to some timestmap based keys. If its timestamp based values, we could trigger clustering based on record key and so chances that your updates are confined to lesser no of file groups per partition(but large perc of records within each file group) instead of updating very less perc among large no of file groups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793259099 @nsivabalan 1. Size of the table and no file objects in root folder of table. https://github.com/apache/hudi/assets/109939327/bdee4b13-f9d5-4edf-9e69-92d13a24fe79";> 2. Yes its COW table. 4. Yes your calculation is correct on file groups. We can think of MOR for future upgrades we may not able to switch right now. 5. Sure will remove configs and run. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
nsivabalan commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793247811 if my understanding of your pipeline/workload is wrong, lets sync up in hudi OSS workspace. we can see whats going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
nsivabalan commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793247596 hey @darlatrade : can you help w/ some more info. 1. Whats the size of the table. 2. I assume its COW table. 3. based on your stats, looks like we have 60 file groups minimum per partition and we are updating 12 partitions which comes to 720 file groups. So, this will involve rewriting 720 parquet files for which hudi might spin up 720- tasks. With COW table, its known that updating very small percentage of data spread across lot of file groups might result in some overhead. if these matches your workload, and if you prefer faster write times, may be you can try MOR table. 4. Also, I see you are in 0.10.1. Some of the configs you have shared may not be applicable in 0.10. just incase you may want to remove them ``` "hoodie.metadata.index.bloom.filter.enable": "true", "hoodie.metadata.index.bloom.filter.parallelism": 100, "hoodie.metadata.index.bloom.filter.column.list": "id", "hoodie.bloom.index.use.metadata": "true", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.metadata.index.column.stats.column.list": "col1,col2,col3", "hoodie.enable.data.skipping": "true" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793007675 Thanks for quick reply @vinothchandar We completed most of the testing on 0.10.1. May not be able to upgrade soon. But at least I can try with 0.14 and test for this table if that helps. Here is the stages screenshot. https://github.com/apache/hudi/assets/109939327/d45eda0f-611e-47af-ac02-ebd6781848e1";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
vinothchandar commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792986756 @darlatrade Just to weed anything out. is it easy for you to try this table on 0.14 version in a test/staging environment? Do you have the Spark Stages UI screenshot. we can see from there how the input amplifies across different stages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792891849 Commit file has 16745 lines. I have month level partitions and last commit touched almost 1 year (12) partitions. We are maintaining 3 years 36 partitions (12 per a year). Looks like 700 file groups are updated (Found in "fileIdAndRelativePaths" section) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
ad1happy2go commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792819209 You can try to open this commit file and see how many file groups are being updated as part of this commit. How many partitions you have in your table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
darlatrade commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792681479 Thanks for the reply Here is the stage detail. Not sure where to look at exact size. https://github.com/apache/hudi/assets/109939327/14db67f3-aebd-4ed4-b2f0-48ab5398171d";> I see in .hoodie that its inserting 700KB ![](https://user-images.githubusercontent.com/109939327/280098820-10d083b4-a223-4dba-9411-33e3bd1bae76.jpg) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
ad1happy2go commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792587002 @darlatrade As I see it is taking time in the "Doing partition and writing data", it probably mean your incremental may be touching lot of file groups so it had to rewrite lot of parquet files as it is COW table. Can you check how much data got written from this stage on Spark UI? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org