[ https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000339#comment-15000339 ]
Daniel Lemire commented on SPARK-11583: --------------------------------------- [~Qin Yao] [~aimran50] If you guys want to setup a documented benchmark with accompanying analysis. I'd be glad to help, if my help is needed. Ideally, a benchmark should be setup based on interesting or representative workloads so that it is as relevant as possible to applications. No data structure, no matter which one, is always best in all possible scenarios. There are certainly cases where an uncompressed BitSet is the best choice. Other cases where a hash set is best. And so forth. What matters is what works for the applications at hand. > Make MapStatus use less memory uage > ----------------------------------- > > Key: SPARK-11583 > URL: https://issues.apache.org/jira/browse/SPARK-11583 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core > Reporter: Kent Yao > > In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I > said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. > For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. > Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125]. > So if we use a HashSet[Int] to store reduceId (when non-empty blocks are > dense,use reduceId of empty blocks; when sparse, use non-empty ones). > For dense cases: if HashSet[Int](numNonEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks > For sparse cases: if HashSet[Int](numEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks > sparse case, 299/300 are empty > sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5) > dense case, no block is empty > sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org