vinothchandar commented on issue #1498: Migrating parquet table to hudi issue 
[SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612709170
 
 
   Few clarifications: 
   
   >For initial bench-marking we generate standard tpch data 
   
   Is it possible to share the data generation tool with us or point us to 
reproducing this ourselves locally? We can go much faster if we are able to 
repro this ourselves.. 
   
   >Schema for lineitem
   
   What's your record key and partition key? (was unable to spot this from the 
snippet above).. If you have monotonically increasing key (say timestamp 
prefix) and already know the partition to which an increment record (IIUC this 
means incoming writes) belong to, then upsert performance will be optimal.
   
    
   >We have 2 versions of generated updates, one which touches only the last 
quarter of the year and another one generated randomly to touch most of the 
parquet parts during updates.
   
   If you have a workload that will touch every file, then you could use the 
#1402  that is being built. Bloom filter checking will anyway lead to opening 
up all files in that scenario anyway... 
   
   >Currently we generate no duplicates for the base table and increments.
   
   By default, upsert will also de-dupe the increments once.. So if this the 
norm, you can turn off `hoodie.combine.before.upsert=false` to avoid an extra 
shuffle. 
   
    >As you can see the data has been shuffled to disk twice, applying the 
sorting twice as well.
   
   so this is without overriding the user defined partitioner?  btw the two 
jobs you see is how Spark sort works, first job does reservoir sampling to get 
ranges, and the second one actually sorts.. 
   
   > Eagerly persist the input RDD before the bulk-insert, which uses the same 
sorting provided before the bulk-insert.
   
   bulk_insert was designed to do an initial sort and write data without 
incurring large memory overheads associated with caching.. Spark cache is a 
LRU.. so it will thrash a fair bit if you start spilling due to lack of memory. 
I would not recommend trying this.. 
   
   > Note: this approach impact the Upsert time significantly specially if you 
didn't apply any sorting to the data, it might be because the upsert operation 
touched most of the parquet parts.
   
   yes.. you are right.. sorting gives you a dataset which is initially 
sorted/ordered by keys and if you have ordered keys, hudi will preserve this 
and extract upsert performance by filtering out files not in range during 
indexing.. At Uber, when we moved all the tables to hudi, we found this one 
time sort, well worth the initial cost.. It repaid itself many times over the 
course of a quarter. 
   
   
   >it is actually a narrow transformation after the sorting operation.
   
   Its the action that triggers the actual parquet writing. So the 30 second 
odd you see if the actual cost of writing data.. 
   
   >If I get your suggestion right, would you suggest to initially load the 
table using upsert or insert operation for the whole table instead of 
bulk-insert?
   
   No.. bulk_insert + sorting is what I recommend (with good key design) for 
large scale deployment like you are talking about.. if you don't want to 
convert all data, @bvaradar & @umehrot2 will have the ability to seamlessly 
bootstrap the data into hudi without rewriting in the next release.. (you ll 
also have the ability to do a one-time bulk_insert for last N partitions to get 
the upsert performance benefits as we discussed above)..
   
   I would be happy to jump on a call with you folks and get this moving 
along.. I am also very excited to work with a user like yourself and move the 
perf aspects of the project along more.. 
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to