xiarixiaoyao commented on pull request #2673: URL: https://github.com/apache/hudi/pull/2673#issuecomment-799905602
@nsivabalan yes, since the problem of company information security, i cannot paste screenshot of test result and dump. before fix env: (executor 4 core 8G)*50 step1: merge(df, 800 , "hudikey", "testOOM", DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 616s step2: merge(df, 800 , "hudikey", "testOOM1", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 710s step3: merge(df, 800 , "hudikey", "testOOM2", DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 676s step4: merge(df, 800 , "hudikey", "testOOM3", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 1077s step5: merge(df, 800 , "hudikey", "testOOM4", DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 1154s step6: merge(df, 800 , "hudikey", "testOOM5", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 2055s (some executor oom) Analysis dump: we find More than 90 percent of memory is consumed by cached rdd after fix step1: merge(df, 800 , "hudikey", "testOOM", DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 632s step2: merge(df, 800 , "hudikey", "testOOM1", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 710s step3: merge(df, 800 , "hudikey", "testOOM2", DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 698s step4: merge(df, 800 , "hudikey", "testOOM3", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 723s step5: merge(df, 800 , "hudikey", "testOOM4", DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 616s step6: merge(df, 800 , "hudikey", "testOOM5", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert") time cost: 703s One last point, when we cached some rdds, we should uncache those rdds timely once those rdds are not used。 spark can uncached rdds automaticly but this process is uncertain。 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org