Github user squito commented on the pull request: https://github.com/apache/spark/pull/9661#issuecomment-156491055 @yaooqinn thanks for the additional analysis, but I would really prefer this is written up and attached to the jira for reference -- extended discussions in the PR are harder to keep track of later on. Also include the code and the additional cases, in particular the worst case analysis, from alternating bits. If you would prefer that someone else puts those details together (maybe me or @lemire), we can help, but it would be great if you could go that last extra mile :) I don't think the analysis of the "uncontinuous" case and whether to track empty / non-empty blocks is correct. I wouldn't say that `runOptimize` "fails" in any of the cases -- there are sometimes when it doesn't decrease the size, but that doesn't mean that another approach would make the size significatnly smaller. The key question is: does the memory usage change if we track empty or non-empty blocks? The main case we should be worried about is when the bitset is very full. We *might* be better off inverting all the bits. Here's a test for that by taking a bitset of 200k elements, where all the elements are set *except* for every 1000th element. We compare the size (a) after adding all elements (b) after calling `runOptimize` (c) after flipping the bits. Also, just to be extra-sure, I also compare by directly adding the bits in their flipped state, just to make sure that there isn't a size difference from calling `flip` at the end: ```scala import org.roaringbitmap._ val n = 2e5.toInt val orig = new RoaringBitmap() val flip1 = new RoaringBitmap() (0 until n).foreach { idx => if (idx % 1000 == 0) { flip1.add(idx) } else { orig.add(idx) } } val flip2 = orig.clone() flip2.flip(0,n) val origOptimized = orig.clone() origOptimized.runOptimize() scala> orig.getSizeInBytes res1: Int = 31374 scala> flip1.getSizeInBytes res2: Int = 432 scala> flip2.getSizeInBytes res4: Int = 432 scala> origOptimized.getSizeInBytes res6: Int = 844 ``` This shows that the flipped version is a bit better than the the version w/ just `runOptimize` (as daniel had indicated in earlier comments). However, the difference is pretty small. IMO thats not worth worrying much about. There are two other reasons we *might* want to use the flipped bits: (1) memory used while building the bitset and (2) time taken to build the bitset. However, both of those reasons are much less important. And in either case, we could only improve things in all cases by making two loops through the array -- one to count the non-zeros and decide which version is sparser, and then a second pass to create the bitset. This is *way* less important, so I wouldn't really worry about it now. Certainly if we just stick to always tracking the empty blocks its no worse than what we have now or what was in 1.5 and before.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org