GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/1555
SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup JIRA: https://issues.apache.org/jira/browse/SPARK-2657 Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers. There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `@DeveloperAPI` so I think it's okay to change it here. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mateiz/spark compact-groupby Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1555.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1555 ---- commit 10f0de1ee86563b5bec6c8f1270a8198d6449393 Author: Matei Zaharia <ma...@databricks.com> Date: 2014-07-23T22:36:45Z A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers commit ed577ab3fa50de0ed1bd21eae43013ffa6dac51c Author: Matei Zaharia <ma...@databricks.com> Date: 2014-07-23T22:37:31Z Use CompactBuffer in groupByKey commit 9b4c6e811159857c075528dab02f6c4db7688dde Author: Matei Zaharia <ma...@databricks.com> Date: 2014-07-23T23:05:14Z Use CompactBuffer in CoGroupedRDD commit 775110fa6124e090c0aeed6baf7a408be3f30f9a Author: Matei Zaharia <ma...@databricks.com> Date: 2014-07-23T23:17:12Z Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers CoGroupedRDD is a @DeveloperApi but this seemed worthwhile. commit 197cde8dccb4c7dee1c9e6e9460b221988083d9b Author: Matei Zaharia <ma...@databricks.com> Date: 2014-07-23T23:41:27Z Make CompactBuffer extend Seq to make its toSeq more efficient ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---