[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999142#comment-14999142
 ] 

Daniel Lemire commented on SPARK-11583:
---------------------------------------

[~Qin Yao] wrote: 

"Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, 
but take 20% memory usage more than BitSet. They both don't be compressed 
in-memory. According to the annotations of the former 
Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not 
in-memory."


I think that's a misunderstanding.

Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton 
of memory. Druid uses them for memory-mapped bitmaps, and it compresses well.

If you do flips, then it is possible that Roaring might end up being 
inefficient. Lucene has one approach to that, in RoaringBitmap, we offer the 
"runOptimize" function.

> Make MapStatus use less memory uage
> -----------------------------------
>
>                 Key: SPARK-11583
>                 URL: https://issues.apache.org/jira/browse/SPARK-11583
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to