xiarixiaoyao commented on pull request #2673:
URL: https://github.com/apache/hudi/pull/2673#issuecomment-799905602
@nsivabalan yes, since the problem of company information security, i
cannot paste screenshot of test result and dump.
before fix
env: (executor 4 core 8G)*50
step1: merge(df, 800 , "hudikey", "testOOM",
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 616s
step2: merge(df, 800 , "hudikey", "testOOM1",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 710s
step3: merge(df, 800 , "hudikey", "testOOM2",
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 676s
step4: merge(df, 800 , "hudikey", "testOOM3",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 1077s
step5: merge(df, 800 , "hudikey", "testOOM4",
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 1154s
step6: merge(df, 800 , "hudikey", "testOOM5",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 2055s
(some executor oom)
Analysis dump: we find More than 90 percent of memory is consumed by cached
rdd
after fix
step1: merge(df, 800 , "hudikey", "testOOM",
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 632s
step2: merge(df, 800 , "hudikey", "testOOM1",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 710s
step3: merge(df, 800 , "hudikey", "testOOM2",
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 698s
step4: merge(df, 800 , "hudikey", "testOOM3",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 723s
step5: merge(df, 800 , "hudikey", "testOOM4",
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 616s
step6: merge(df, 800 , "hudikey", "testOOM5",
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 703s
One last point, when we cached some rdds, we should uncache those rdds
timely once those rdds are not used。 spark can uncached rdds automaticly but
this process is uncertain。
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org