[ 
https://issues.apache.org/jira/browse/KYLIN-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032984#comment-15032984
 ] 

Daniel Lemire commented on KYLIN-1186:
--------------------------------------

RoaringBitmap supports 32-bit *unsigned* integers, so it goes up to 4294967296. 
(Java does not have unsigned integers, but RoaringBitmap treats the signed 
integers as if they were unsigned.) This is convenient if you want to wrap 
RoaringBitmap objects into a larger data structure supporting wider integers. 
You can use RoaringBitmap to capture the least significant 32 bits, and use the 
rest of your data structure to support the most significant bits. Another way 
to see the problem is that you can partition your global space in 32-bit spaces 
and use one RoaringBitmap per 32-bit space.

There is no plan to support long or BitInt in the RoaringBitmap library itself 
because there might not be a need to. As I just described in the previous 
paragraph, the library already makes it easy to extend it to support wider 
integers.

Samy Chambi ([email protected]) wrote a few such wrappers and he is working 
on an eventual publication that would compare them. I will point him to this 
issue and he might follow-up, or an interested party could get in touch with 
him and see if he is willing to share his code or collaborate.

Alternatively, if someone wanted to contribute a 64-bit or BitInt extension to 
the RoaringBitmap library... we are always interested in receiving Pull 
Request... with the caveat that, in this case, it might be better to do the 
work in a separate package that uses RoaringBitmap instead.

> Support precise Count Distinct using bitmap
> -------------------------------------------
>
>                 Key: KYLIN-1186
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1186
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: v1.1
>            Reporter: Yerui Sun
>            Assignee: ZhouQianhao
>             Fix For: v2.0, 1.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> For now, kylin only support non-precise count distinct by hyperloglog.
> In our production scenario, there're strongly requirements for precise count 
> distinct, mainly for the column of type int or bigint, such as user-id, 
> product-id, etc.
> Implementing of precise count distinct for all types is difficult and not 
> efficiency. However, only supporting int or bigint make this much easier. The 
> values can be projected into a bitmap, which is easy to be compressed and 
> stored, and easy to count.
> I've created a POC based on RoaringBitmap, proving that worked. There's some 
> more work to be done:
> * RoaringBitmap only support int, there need a solution to support bigint;
> * Add a new measure and codec, like HyperLogLogPlusCounter, make it easy to 
> use;
> * Add new measure on web ui, and check that whether the column type is int 
> or bigint;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to