GitHub user mateiz opened a pull request:

    https://github.com/apache/spark/pull/1555

    SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & 
cogroup

    JIRA: https://issues.apache.org/jira/browse/SPARK-2657
    
    Our current code uses ArrayBuffers for each group of values in groupBy, as 
well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of 
overhead if there are few values in them, which is likely to happen in cases 
such as join. In particular, they have a pointer to an Object[] of size 16 by 
default, which is 24 bytes for the array header + 128 for the pointers in 
there, plus at least 32 for the ArrayBuffer data structure. This patch replaces 
the per-group buffers with a CompactBuffer class that can store up to 2 
elements more efficiently (in fields of itself) and acts like an ArrayBuffer 
beyond that. For a key's elements in CoGroupedRDD, we use an Array of 
CompactBuffers instead of an ArrayBuffer of ArrayBuffers.
    
    There are some changes throughout the code to deal with CoGroupedRDD 
returning Array instead. We can also decide not to do that but CoGroupedRDD is 
a `@DeveloperAPI` so I think it's okay to change it here.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mateiz/spark compact-groupby

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1555
    
----
commit 10f0de1ee86563b5bec6c8f1270a8198d6449393
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T22:36:45Z

    A CompactBuffer that's more memory-efficient than ArrayBuffer for small 
buffers

commit ed577ab3fa50de0ed1bd21eae43013ffa6dac51c
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T22:37:31Z

    Use CompactBuffer in groupByKey

commit 9b4c6e811159857c075528dab02f6c4db7688dde
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T23:05:14Z

    Use CompactBuffer in CoGroupedRDD

commit 775110fa6124e090c0aeed6baf7a408be3f30f9a
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T23:17:12Z

    Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
    
    CoGroupedRDD is a @DeveloperApi but this seemed worthwhile.

commit 197cde8dccb4c7dee1c9e6e9460b221988083d9b
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T23:41:27Z

    Make CompactBuffer extend Seq to make its toSeq more efficient

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to