Re: [I] UPSERTs are taking time [hudi]

2024-01-23 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1906993415

   Yes that works


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2024-01-17 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1895937269

   RECORD_INDEX is not working with bulk_upsert. How do we handle to load 
initial history. Its taking forever to load. Any recommendations for 
bulk_upsert to load 500GB for initial load?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2024-01-16 Thread via GitHub


ad1happy2go commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1894910852

   @darlatrade Did the suggestion worked? DO you need any other help here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-15 Thread via GitHub


ad1happy2go commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1812124131

   @darlatrade You need to increase 
[hoodie.metadata.record.index.min.filegroup.count](https://hudi.apache.org/docs/configurations/#hoodiemetadatarecordindexminfilegroupcount)
 to a higher number depending upon the amount of data you have. Let us know if 
it helps. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-10 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1805844791

   I am trying to initialize new table with RLI. I need to load the history 
first which is has  3210407531 records and 520GB data.
   
   Spark context is shutting down to load this much data..Also number objects 
are huge as below screenshot
   https://github.com/apache/hudi/assets/109939327/34b21e41-5631-498a-a15f-d3d71ba728c3";>
   
   **Hoodie config :**
   "className":
   "org.apache.hudi",
   "hoodie.table.name": tgt_tbl,
   "hoodie.datasource.write.recordkey.field": "id",
   "hoodie.datasource.write.precombine.field": "eff_fm_cent_tz",
   "hoodie.datasource.write.operation": "upsert",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.partitionpath.field": "year,month",
   "hoodie.datasource.hive_sync.support_timestamp": "true",
   "hoodie.datasource.hive_sync.enable": "true",
   "hoodie.datasource.hive_sync.assume_date_partitioning": "false",
   "hoodie.datasource.hive_sync.table": tgt_tbl,
   "hoodie.datasource.hive_sync.use_jdbc": "false",
   "hoodie.datasource.hive_sync.mode": "hms",
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "hoodie.bulkinsert.shuffle.parallelism": hudi_Insert_parallelism,
   "hoodie.index.type": "RECORD_INDEX",
   "hoodie.metadata.record.index.enable":"true",
   "hoodie.metadata.enable": "true"   
   
   Error:
   https://github.com/apache/hudi/assets/109939327/ec7c203d-b720-427a-91e0-c4e7a43615a0";>
   
   Can you suggest what parameters need to be used to load this data? I need to 
load history first before starting deltas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-09 Thread via GitHub


soumilshah1995 commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1804545703

   well Apache Hudi 0.14 you can leverage RLI for faster UPSERT 
   
   ```
   
'hoodie.metadata.record.index.enable': 'true',
   'hoodie.index.type':'RECORD_INDEX'
   ```
   
   sample code can be found 
https://soumilshah1995.blogspot.com/2023/10/apache-hudi-014-announces.html
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-09 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1804498900

   what are the hadoop configs to be considered to load 500GB of data in 
monthly partitions for RLI and 0.14?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-08 Thread via GitHub


nsivabalan commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1803119793

   and yeah. upgrading to 0.14.0, you can leverage RLI and that should def 
boost your index latencies and write latencies.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-08 Thread via GitHub


nsivabalan commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1803119446

   yeah. As I suggested before, you may want to try our MOR table. and try 
using SIMPLE index. in 0.10.1 hudi uses bloom index and for random keys it 
might incur some unnecessary overhead. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-08 Thread via GitHub


soumilshah1995 commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1802964892

   can you use new RLI ?
   
   
https://www.linkedin.com/pulse/upsert-performance-evaluation-hudi-014-spark-341-record-soumil-shah-oupre%3FtrackingId=PeKhUkGNTkuSD1VRqoI3rw%253D%253D/?trackingId=PeKhUkGNTkuSD1VRqoI3rw%3D%3D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-06 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1797388578

   @nsivabalan Any inputs on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793304296

   Here is how "id" is derived. 
   
   df.withColumn("id", 
concat("evnt_cent_tz",lit("_"),md5(concat("key_col1","key_col2","key_col3","evnt_cent_tz"
   
   Sample values from table:
   
   https://github.com/apache/hudi/assets/109939327/ec0655cf-766f-4ec5-93d9-a857aaf9d910";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


nsivabalan commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793285715

   got it. 
   may I know whats your record key comprises of. I mean, I see it as "id". but 
is it random id or does it refer to some timestmap based keys. If its timestamp 
based values, we could trigger clustering based on record key and so chances 
that your updates are confined to lesser no of file groups per partition(but 
large perc of records within each file group) instead of updating very less 
perc among large no of file groups. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793259099

   @nsivabalan 
   
   1. Size of the table and no file objects in root folder of table.
   
   https://github.com/apache/hudi/assets/109939327/bdee4b13-f9d5-4edf-9e69-92d13a24fe79";>
   
   2. Yes its COW table.
   
   4.  Yes your calculation is correct on file groups. We can think of MOR for 
future upgrades we may not able to switch right now.
   5. Sure will remove configs and run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


nsivabalan commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793247811

   if my understanding of your pipeline/workload is wrong, lets sync up in hudi 
OSS workspace. we can see whats going on. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


nsivabalan commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793247596

   hey @darlatrade : 
   can you help w/ some more info. 
   1. Whats the size of the table. 
   2. I assume its COW table. 
   3. based on your stats, looks like we have 60 file groups minimum per 
partition and we are updating 12 partitions which comes to 720 file groups. So, 
this will involve rewriting 720 parquet files for which hudi might spin up 720- 
tasks. With COW table, its known that updating very small percentage of data 
spread across lot of file groups might result in some overhead. 
   
   if these matches your workload, and if you prefer faster write times, may be 
you can try MOR table. 
   
   4. Also, I see you are in 0.10.1. Some of the configs you have shared may 
not be applicable in 0.10. just incase you may want to remove them
   ```
   "hoodie.metadata.index.bloom.filter.enable": "true",
   "hoodie.metadata.index.bloom.filter.parallelism": 100,
   "hoodie.metadata.index.bloom.filter.column.list": "id",
   "hoodie.bloom.index.use.metadata": "true",
   "hoodie.metadata.index.column.stats.enable": "true",
   "hoodie.metadata.index.column.stats.column.list": "col1,col2,col3",
   "hoodie.enable.data.skipping": "true"
   ``` 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793007675

   Thanks for quick reply @vinothchandar 
   We completed most of the testing on 0.10.1. May not be able to upgrade soon. 
But at least I can try with 0.14 and test for this table if that helps.
   
   Here is the stages screenshot.
   https://github.com/apache/hudi/assets/109939327/d45eda0f-611e-47af-ac02-ebd6781848e1";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


vinothchandar commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792986756

   @darlatrade Just to weed anything out. is it easy for you to try this table 
on 0.14 version in a test/staging environment?
   
   Do you have the Spark Stages UI screenshot. we can see from there how the 
input amplifies across different stages. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792891849

   Commit file has 16745 lines. 
   I have month level partitions and last commit touched almost 1 year (12) 
partitions. We are maintaining 3 years 36 partitions (12 per a year). Looks 
like 700 file groups are updated (Found in "fileIdAndRelativePaths" section)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


ad1happy2go commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792819209

   You can try to open this commit file and see how many file groups are being 
updated as part of this commit. How many partitions you have in your table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


darlatrade commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792681479

   Thanks for the reply
   
   Here is the stage detail. Not sure where to look at exact size.
   https://github.com/apache/hudi/assets/109939327/14db67f3-aebd-4ed4-b2f0-48ab5398171d";>
   
   I see in .hoodie that its inserting 700KB
   
![](https://user-images.githubusercontent.com/109939327/280098820-10d083b4-a223-4dba-9411-33e3bd1bae76.jpg)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] UPSERTs are taking time [hudi]

2023-11-03 Thread via GitHub


ad1happy2go commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1792587002

   @darlatrade As I see it is taking time in the "Doing partition and writing 
data", it probably mean your incremental may be touching lot of file groups so 
it had to rewrite lot of parquet files as it is COW table. Can you check how 
much data got written from this stage on Spark UI?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org