LantaoJin opened a new pull request #29869: URL: https://github.com/apache/spark/pull/29869
### What changes were proposed in this pull request? Add a configuration to `LISTENER_BUS_ALLOW_EXTERNAL_ACCUMULATORS_ENTER_EVENT` and update the external accumulators which name is not started with `InternalAccumulator.METRICS_PREFIX` before entering into `eventProcessLoop`. This could avoid the driver holding the heavy accumulator in the event loop for a long time. The memory could be release in young GC. ### Why are the changes needed? We use Spark + Delta Lake, recently we find our Spark driver faced full GC problem (very heavy) when users submit a MERGE INTO query. The driver held over 100GB memory (depends on how much the max heap size set) and can not be GC forever. By making a heap dump we found the root cause. <img width="839" alt="Screen Shot 2020-09-25 at 11 35 01 AM" src="https://user-images.githubusercontent.com/1853780/94225514-c3159380-ff27-11ea-9b29-22bb4fa77533.png"> From above heap dump, Delta uses a SetAccumulator to records touched files names ```scala // Accumulator to collect all the distinct touched files val touchedFilesAccum = new SetAccumulator[String]() spark.sparkContext.register(touchedFilesAccum, TOUCHED_FILES_ACCUM_NAME) // UDFs to records touched files names and add them to the accumulator val recordTouchedFileName = udf { (fileName: String) => { touchedFilesAccum.add(fileName) 1 }}.asNondeterministic() ``` In a big query, each task may hold thousands of file names, and if a stage contains dozens of thousands of tasks, DAGscheduler may hold millions of `CompletionEvent`. And each `CompletionEvent` holds the thousands of file names in its `accumUpdates`. All accumulator objects will use Spark listener event to deliver to the event loop and even a full GC can not release memory. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a UT ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org