Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/9661#issuecomment-156491055
  
    @yaooqinn thanks for the additional analysis, but I would really prefer 
this is written up and attached to the jira for reference -- extended 
discussions in the PR are harder to keep track of later on.  Also include the 
code and the additional cases, in particular the worst case analysis, from 
alternating bits.  If you would prefer that someone else puts those details 
together (maybe me or @lemire), we can help, but it would be great if you could 
go that last extra mile :)
    
    I don't think the analysis of the "uncontinuous" case and whether to track 
empty / non-empty blocks is correct.  I wouldn't say that `runOptimize` "fails" 
in any of the cases -- there are sometimes when it doesn't decrease the size, 
but that doesn't mean that another approach would make the size significatnly 
smaller.  The key question is: does the memory usage change if we track empty 
or non-empty blocks?  The main case we should be worried about is when the 
bitset is very full.  We *might* be better off inverting all the bits.  Here's 
a test for that by taking a bitset of 200k elements, where all the elements are 
set *except* for every 1000th element.  We compare the size (a) after adding 
all elements (b) after calling `runOptimize` (c) after flipping the bits.  
Also, just to be extra-sure, I also compare by directly adding the bits in 
their flipped state, just to make sure that there isn't a size difference from 
calling `flip` at the end:
    
    ```scala
    import org.roaringbitmap._
    val n = 2e5.toInt
    val orig = new RoaringBitmap()
    val flip1 = new RoaringBitmap()
    (0 until n).foreach { idx =>
       if (idx % 1000 == 0) {
         flip1.add(idx)
       } else {
         orig.add(idx)
       }
    }
    val flip2 = orig.clone()
    flip2.flip(0,n)
    val origOptimized = orig.clone()
    origOptimized.runOptimize()
    
    scala> orig.getSizeInBytes
    res1: Int = 31374
    
    scala> flip1.getSizeInBytes
    res2: Int = 432
    
    scala> flip2.getSizeInBytes
    res4: Int = 432
    
    scala> origOptimized.getSizeInBytes
    res6: Int = 844
    ```
    
    This shows that the flipped version is a bit better than the the version w/ 
just `runOptimize` (as daniel had indicated in earlier comments).  However, the 
difference is pretty small.  IMO thats not worth worrying much about.
    
    There are two other reasons we *might* want to use the flipped bits: (1) 
memory used while building the bitset and (2) time taken to build the bitset.  
However, both of those reasons are much less important.  And in either case, we 
could only improve things in all cases by making two loops through the array -- 
one to count the non-zeros and decide which version is sparser, and then a 
second pass to create the bitset.  This is *way* less important, so I wouldn't 
really worry about it now.  Certainly if we just stick to always tracking the 
empty blocks its no worse than what we have now or what was in 1.5 and before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to