vinothchandar commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT] URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612709170 Few clarifications: >For initial bench-marking we generate standard tpch data Is it possible to share the data generation tool with us or point us to reproducing this ourselves locally? We can go much faster if we are able to repro this ourselves.. >Schema for lineitem What's your record key and partition key? (was unable to spot this from the snippet above).. If you have monotonically increasing key (say timestamp prefix) and already know the partition to which an increment record (IIUC this means incoming writes) belong to, then upsert performance will be optimal. >We have 2 versions of generated updates, one which touches only the last quarter of the year and another one generated randomly to touch most of the parquet parts during updates. If you have a workload that will touch every file, then you could use the #1402 that is being built. Bloom filter checking will anyway lead to opening up all files in that scenario anyway... >Currently we generate no duplicates for the base table and increments. By default, upsert will also de-dupe the increments once.. So if this the norm, you can turn off `hoodie.combine.before.upsert=false` to avoid an extra shuffle. >As you can see the data has been shuffled to disk twice, applying the sorting twice as well. so this is without overriding the user defined partitioner? btw the two jobs you see is how Spark sort works, first job does reservoir sampling to get ranges, and the second one actually sorts.. > Eagerly persist the input RDD before the bulk-insert, which uses the same sorting provided before the bulk-insert. bulk_insert was designed to do an initial sort and write data without incurring large memory overheads associated with caching.. Spark cache is a LRU.. so it will thrash a fair bit if you start spilling due to lack of memory. I would not recommend trying this.. > Note: this approach impact the Upsert time significantly specially if you didn't apply any sorting to the data, it might be because the upsert operation touched most of the parquet parts. yes.. you are right.. sorting gives you a dataset which is initially sorted/ordered by keys and if you have ordered keys, hudi will preserve this and extract upsert performance by filtering out files not in range during indexing.. At Uber, when we moved all the tables to hudi, we found this one time sort, well worth the initial cost.. It repaid itself many times over the course of a quarter. >it is actually a narrow transformation after the sorting operation. Its the action that triggers the actual parquet writing. So the 30 second odd you see if the actual cost of writing data.. >If I get your suggestion right, would you suggest to initially load the table using upsert or insert operation for the whole table instead of bulk-insert? No.. bulk_insert + sorting is what I recommend (with good key design) for large scale deployment like you are talking about.. if you don't want to convert all data, @bvaradar & @umehrot2 will have the ability to seamlessly bootstrap the data into hudi without rewriting in the next release.. (you ll also have the ability to do a one-time bulk_insert for last N partitions to get the upsert performance benefits as we discussed above).. I would be happy to jump on a call with you folks and get this moving along.. I am also very excited to work with a user like yourself and move the perf aspects of the project along more..
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services