[GitHub] [hudi] xiarixiaoyao commented on pull request #2673: [HUDI-1688] Uncache Rdd once write operation is complete

2021-03-17 Thread GitBox


xiarixiaoyao commented on pull request #2673:
URL: https://github.com/apache/hudi/pull/2673#issuecomment-801556436


   @rubenssotospark will uncache rdd automaticly in a LRU fashion, so if 
your program has enough memory , there will be no problem.   In my program hudi 
cached too large data in memory,  program occur oom before spark automatic 
clean cached rdd. 
   @vinothchandar  no leak of sort occur, simply the spark automatic cleaning 
not keeping up .   i test on my env  even if we donnot use blocking=true , 
spark can uncache rdd in a very short time
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on pull request #2673: [HUDI-1688] Uncache Rdd once write operation is complete

2021-03-15 Thread GitBox


xiarixiaoyao commented on pull request #2673:
URL: https://github.com/apache/hudi/pull/2673#issuecomment-799905602


   @nsivabalan  yes, since the problem of company information security, i 
cannot paste screenshot of test result and dump.
   before fix
   env:  (executor 4 core 8G)*50
   step1: merge(df, 800  , "hudikey", "testOOM", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")   time cost: 616s
   step2: merge(df, 800  , "hudikey", "testOOM1", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 710s
   step3: merge(df, 800  , "hudikey", "testOOM2", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 676s
   step4: merge(df, 800  , "hudikey", "testOOM3", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 1077s
   step5: merge(df, 800  , "hudikey", "testOOM4", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 1154s
   step6: merge(df, 800  , "hudikey", "testOOM5", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 2055s 
(some executor oom)
   
   
   Analysis dump: we find More than 90 percent of memory is consumed by cached 
rdd
   
   after fix
   step1: merge(df, 800  , "hudikey", "testOOM", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")   time cost: 632s
   step2: merge(df, 800  , "hudikey", "testOOM1", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 710s
   step3: merge(df, 800  , "hudikey", "testOOM2", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 698s
   step4: merge(df, 800  , "hudikey", "testOOM3", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 723s
   step5: merge(df, 800  , "hudikey", "testOOM4", 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 616s
   step6: merge(df, 800  , "hudikey", "testOOM5", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert")  time cost: 703s
   
   One last point, when we cached some rdds, we should uncache those rdds 
timely once those rdds are not used。 spark can uncached rdds automaticly but 
this process is uncertain。



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org