[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999848#comment-14999848
 ] 

Imran Rashid commented on SPARK-11583:
--------------------------------------

Hi [~Qin Yao] -- unfortunately that example is OOMing for a different reason.  
The scala shell is just trying to call {{toString}} the roaring bitmap 200k 
times, and building a string that is too big.  It works for the Bitset just b/c 
the bitset.toString() just returns a short string, the object ID, instead of a 
full list of the set bits.

also that test doesn't really test creating 200k bitsets -- the array is just 
storing many references to the same object.  (You can see in the toString of 
the bitset array, the same object id is repeated over and over.)  You need to 
do something more like

{noformat}
val ar = Array.fill(200000){
  val bs = new BitSet(200000)
  for (i <- 0 to 199999) bs.set(i)
  bs
}
{noformat}

similarly for the roaring bitmap.  Though there really isn't any need to wrap 
in the array, we are just interested in the memory of one object, which you can 
measure without going to OOM, eg. with {{jmap}}.  (incidentally, both versions 
lead to OOM for me when you actually create 200k different objects.)

> Make MapStatus use less memory uage
> -----------------------------------
>
>                 Key: SPARK-11583
>                 URL: https://issues.apache.org/jira/browse/SPARK-11583
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to