[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999117#comment-14999117
 ] 

Daniel Lemire commented on SPARK-11583:
---------------------------------------

> When a set is nearly full, RoaringBitmap does not automatically invert the 
> bits in order to minimize space. 

The Roaring implementation in Lucene invert bits to minimize space, as 
descriped...

https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/RoaringDocIdSet.java

The RoaringBitmap libary which we produced does not. However, it does something 
similar upon request.

You might want to try...

     x.flip(0,1001);
     x.runOptimize();
     x.getSizeInBytes();

The call to runOptimize should significantly reduce memory usage in this case. 


The intention is that users should call "runOptimize" when their bitmaps has 
been created and is no long expected to be changed frequently. So "runOptimize" 
should always be called prior to serialization.


> Make MapStatus use less memory uage
> -----------------------------------
>
>                 Key: SPARK-11583
>                 URL: https://issues.apache.org/jira/browse/SPARK-11583
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to