[ 
https://issues.apache.org/jira/browse/KYLIN-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yerui Sun updated KYLIN-1186:
-----------------------------
    Attachment: KYLIN-1186-2.x-staging.2.patch

Attached the patch for 2.x-staging, finally...
However, I found two issues about 2.x-staging, which blocked me for a long time 
when debug:
* With BuildCubeWithEngineTest ran, the inner join cube 
test_kylin_cube_without_slr_empty and test_kylin_cube_with_slr_empty only has 
6000 rows, not 10000 rows. That's made the query failed, due to that H2 query 
has more results than kylin query;
* Failed to merge cube with Bitmap measure after TopN in cube measure desc. The 
reason is the dict measure re-encoding in MergeCuboidMapper, which made the 
deserializing in CuboidReducer failed. I guess the reason is TopN measure is 
re-encoded with new dict in MergeCuboidMapper, but still decoded with old dict 
in CuboidReducer. BTW, I didn't see the profit of re-encoded, with additional 
debug log, the re-encoded value size is equal or bigger than original value 
size.

[[email protected]], would you please take some time to check the two 
above issues, it really effected the 2.x-staging stability.

> Support precise Count Distinct using bitmap
> -------------------------------------------
>
>                 Key: KYLIN-1186
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1186
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: v1.1
>            Reporter: Yerui Sun
>            Assignee: Yerui Sun
>             Fix For: v2.0, v1.3
>
>         Attachments: KYLIN-1186-1.x-staging.2.patch, 
> KYLIN-1186-1.x-staging.patch, KYLIN-1186-2.x-staging.2.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> For now, kylin only support non-precise count distinct by hyperloglog.
> In our production scenario, there're strongly requirements for precise count 
> distinct, mainly for the column of type int or bigint, such as user-id, 
> product-id, etc.
> Implementing of precise count distinct for all types is difficult and not 
> efficiency. However, only supporting int or bigint make this much easier. The 
> values can be projected into a bitmap, which is easy to be compressed and 
> stored, and easy to count.
> I've created a POC based on RoaringBitmap, proving that worked. There's some 
> more work to be done:
> * RoaringBitmap only support int, there need a solution to support bigint;
> * Add a new measure and codec, like HyperLogLogPlusCounter, make it easy to 
> use;
> * Add new measure on web ui, and check that whether the column type is int 
> or bigint;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to