[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999185#comment-14999185
 ] 

Imran Rashid commented on SPARK-11583:
--------------------------------------

thanks [~lemire].  So I guess that means that option 2, going back to 
RoaringBitmap, would really just require us to insert a call to 
{{runOptimize}}, no need for us to worry about whether or not flip the bits 
ourselves.

For the case of relatively full blocks, does the 20% overhead seem like a 
reasonable amount to you?  That sounds super-high to me.  If I understand 
correctly, the worst case of extra overhead will be when all the blocks are 
dense?  Furthermore, it seems like roaring will actually save space, unless the 
maximum element is exactly {{2^n -1}}.  Otherwise, the roaring bitmap will 
still be smaller b/c it can make the final block smaller.  (Well, I guess the 
condition isn't "exactly", its a bit more subtle, but without going into too 
many specifics ...).  I actually tried with 2^17 - 1, and a 
{{java.util.BitSet}} saved less than 1% over roaring.

So I'm now more inclined to stick with RoaringBitmap and let it do its job 
(just make sure we use it correctly by adding a call to {{runOptimize}}

> Make MapStatus use less memory uage
> -----------------------------------
>
>                 Key: SPARK-11583
>                 URL: https://issues.apache.org/jira/browse/SPARK-11583
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to