[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003084#comment-15003084
 ] 

Imran Rashid commented on SPARK-11583:
--------------------------------------

[~lemire] I thought Kent Yao's analysis was pretty reasonable (with a few 
additional suggestions I'll make below).  But it sounds like that isn't what 
you were expecting -- can you suggest what else you have in mind?  I think it 
would be better to do the same thing directly on the bitsets, without the 
additional complexity from spark in the middle, but in general that is what I 
was looking for.  The truth of the matter is that as Spark is meant to be very 
general purpose, we really can't say anything specific about what the data in 
these bitsets will be.  They might be very full; they might be very dense;  
there isn't any particular reason to suspect that they'd be organized into runs 
(more than you'd expect by random chance).  I understand that one format won't 
be better in all cases, but what I'm looking for is to to find the format which 
works well for a wide range of loads, *and* that doesn't contain too big a 
penalty in the worst case.  Eg., if the most memory-intensive cases use 20% 
more memory in roaring than a plain bitset, that is strong evidence against 
roaring, even if its much better in other cases.  OTOH, if its 1% in those 
cases, and 50% better in others, then we should probably still go with roaring. 
 From the analysis I did above, it was less than 1% overhead for alternating 
bits, which I thought would be the worst case.  Are there other cases you think 
are important to look at in particular?

[~Qin Yao] thanks so much for that analysis.  That seems to agree with our 
expectations from the analysis above.  The one case in which the roaring bitmap 
is much worse can be explained by the lack of a call to {{runOptimize}}.  

Can I ask you take it your analysis a step further?
(a) use roaring bitmap 0.5.10
(b) I think we'd be better of if you just looked at the bitsets directly, 
rather than going through spark.  First of all, its kind of confusing to talk 
about dense outputs from spark leading to sparse bitmaps b/c the code uses 
empty 
(c) Given Daniel's explanation above of {{runOptimize}}, can you be sure to 
call {{runOptimize}} after adding all the bits to the roaring bitmap?  That 
would be the equivalent of adding it here to the spark 1.5 code.  
https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L197
(d) Just to eliminate any doubts -- can you see the memory usage from roaring 
bitmap with the bits reversed?  I understand that it *should* be the same, but 
might as well double check.
(e) can you add a case for alternating bits (and whatever else Daniel suggests 
for worst cases)?

In my mind, that should settle things.  It would be great to have the code for 
that analysis attached here if possible, if we ever revisit it.  I have every 
reason to believe this will confirm that we should be using Roaring (with a 
call to {{runOptimize}} added), but lets just make the case very clear.

> Make MapStatus use less memory uage
> -----------------------------------
>
>                 Key: SPARK-11583
>                 URL: https://issues.apache.org/jira/browse/SPARK-11583
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to