liupengcheng created SPARK-31202: ------------------------------------ Summary: Improve SizeEstimator for AppendOnlyMap Key: SPARK-31202 URL: https://issues.apache.org/jira/browse/SPARK-31202 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.2, 3.0.0 Reporter: liupengcheng
Currently, spark's memory management depends on the size estimation for execution and storage. In our real cluster, users always meet the issue OOM due to the inaccurate size estimation for ` AppendOnlyMap`, that's because spark stores KV in an Array[AnyRef] in `AppendOnlyMap` for memory locality, and this value can be CompactBuffer[_] or Array[CompactBuffer[_]] for transformation like cogroup/join/groupBy, but current `SizeEstimator` will still treat this special array as an normal array, so in many cases, we noticed a great bias between the estimated size and the acutal memory consuption. So we improved this in xiaomi: 1. Improve the estimation for AppendOnlyMap when the value type is CompactBuffer 2. Respect jvm gc stats to decide whether to spilling when doing sort/agg In this jira, I propose to solve the first part which is improving the estimation for `AppendOnlyMap` when the value type is CompactBuffer -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org