bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: 
Create a data frame with 4m rows and one column with values 1, 2, 3....4m in 
that column. Bulk insert that into Hudi (using the one column as the 
`recordkey`). This takes ~1 minute to run and the data size is about 30MB. Now 
upsert the same data frame into the table a second time. This take >2 hours to 
run.
   
   Alternatively, if we upsert a new data frame with values 4000001...8m (still 
4m rows upserted), this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  {{HoodieSparkSqlWriter}} job (and 
within that job, the {{count at HoodieSparkSqlWriter.scala}} stage (the 
BloomIndex parts run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small 
example.
   
   Shall i raise a Jira for this? Or is this the expected behavior for such a 
workload?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to