[ https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999185#comment-14999185 ]
Imran Rashid commented on SPARK-11583: -------------------------------------- thanks [~lemire]. So I guess that means that option 2, going back to RoaringBitmap, would really just require us to insert a call to {{runOptimize}}, no need for us to worry about whether or not flip the bits ourselves. For the case of relatively full blocks, does the 20% overhead seem like a reasonable amount to you? That sounds super-high to me. If I understand correctly, the worst case of extra overhead will be when all the blocks are dense? Furthermore, it seems like roaring will actually save space, unless the maximum element is exactly {{2^n -1}}. Otherwise, the roaring bitmap will still be smaller b/c it can make the final block smaller. (Well, I guess the condition isn't "exactly", its a bit more subtle, but without going into too many specifics ...). I actually tried with 2^17 - 1, and a {{java.util.BitSet}} saved less than 1% over roaring. So I'm now more inclined to stick with RoaringBitmap and let it do its job (just make sure we use it correctly by adding a call to {{runOptimize}} > Make MapStatus use less memory uage > ----------------------------------- > > Key: SPARK-11583 > URL: https://issues.apache.org/jira/browse/SPARK-11583 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core > Reporter: Kent Yao > > In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I > said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. > For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. > Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125]. > So if we use a HashSet[Int] to store reduceId (when non-empty blocks are > dense,use reduceId of empty blocks; when sparse, use non-empty ones). > For dense cases: if HashSet[Int](numNonEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks > For sparse cases: if HashSet[Int](numEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks > sparse case, 299/300 are empty > sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5) > dense case, no block is empty > sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org