Github user lemire commented on the pull request:

    https://github.com/apache/spark/pull/9243#issuecomment-150668521
  
    @rxin 
    
    There are definitively cases where attempting to use compressed bitmaps is 
wasteful. For example, if you have a small universe size. E.g., your bitmaps 
represent sets of integer from [0,n) where n is small (e.g., n=64 or n=128). 
    
    It is just generally true that compression is not always a good idea.
    
    The fact that you are able to use uncompressed BitSet and it does not blow 
up memory usage tells me that you might be in a scenario where compression is 
not useful.
    
    Techniques like Roaring or Concise do not make uncompressed BitSet 
obsolete. Rather, they are there to help when regular BitSets would fail you 
due to excessive memory usage.
    
    How can this happen? Well. Suppose that you are trying to index a column 
containing 1000 distinct integer values. If you try to do it with a BitSet, 
each row will use 125 bytes... just to index this column...  if you have 10,000 
distinct values, then you use over 1kB per row just to index this one column. 
And so forth.
    
    But, if your BitSets are tiny then compressing them could definitively be 
wasteful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to