xiarixiaoyao commented on pull request #2673:
URL: https://github.com/apache/hudi/pull/2673#issuecomment-799905602


   @nsivabalan  yes, since the problem of company information security, i 
cannot paste screenshot of test result and dump.
   before fix
   env:  (executor 4 core 8G)*50
   step1: merge(df, 800  , "hudikey", "testOOM", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")   time cost: 616s
   step2: merge(df, 800  , "hudikey", "testOOM1", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 710s
   step3: merge(df, 800  , "hudikey", "testOOM2", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 676s
   step4: merge(df, 800  , "hudikey", "testOOM3", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 1077s
   step5: merge(df, 800  , "hudikey", "testOOM4", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 1154s
   step6: merge(df, 800  , "hudikey", "testOOM5", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 2055s 
(some executor oom)
   
   
   Analysis dump: we find More than 90 percent of memory is consumed by cached 
rdd
   
   after fix
   step1: merge(df, 800  , "hudikey", "testOOM", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")   time cost: 632s
   step2: merge(df, 800  , "hudikey", "testOOM1", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 710s
   step3: merge(df, 800  , "hudikey", "testOOM2", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 698s
   step4: merge(df, 800  , "hudikey", "testOOM3", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 723s
   step5: merge(df, 800  , "hudikey", "testOOM4", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 616s
   step6: merge(df, 800  , "hudikey", "testOOM5", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 703s
   
   One last point, when we cached some rdds, we should uncache those rdds 
timely once those rdds are not used。 spark can uncached rdds automaticly but 
this process is uncertain。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to