Thanks shaofeng and hongbin for your explaining.

Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some 
thinking about the designing:

We indeed just support Int type for now, and cast Long to Int may cause 
precision losing (shaofeng removed the casting and I agreed that), the reason 
mainly is Int has been enough for most cases. 

I thought about to support all types, including String or Date, and the 
conclusion is that’s difficult. One solution is store all the values, that’s 
appear too costly, and another solution is finding the *precisely* projecting 
from string to int, for example dict ( not hash, because the projecting maybe 
conflicting). 
However, the dict generating is still difficult, especially when the 
cardinality is very high. I think KYLIN-1122 facing the same problem, so let’s 
see what’s the solution in KYLIN-1122, maybe we could borrow something.

The reason of casting Long to Int is that bitmap based on RoaringBitmap, which 
maintained by lemire(lem...@gmail.com), just supporting Integer. Expanding it 
to Long is kind of complicated, so I skipped that for now.

Overall, this feature just fitted the common user case, and has absolutely room 
for improvement. Please let me know if you have any idea, and any comment is 
welcome.
 
 

> 在 2016年1月28日,22:33,ShaoFeng Shi <shaofeng...@apache.org> 写道:
> 
> I removed the code for long type in BitmapCounter as the casting will get
> things wrong (but the target is to provide accurate value); @Yerui, for you
> awareness; once we find the solution for long, then add it back.
> 
> 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi <shaofeng...@apache.org>:
> 
>> what's the cardinality of the dimension that you want to count distinct
>> values? Integer's range is enough for most cases, if your case is under
>> this scope, you can try the bitmap with integer; but you need map the value
>> to an unique id and use that within the bitmap. For example, if you want to
>> count distinct users, use the numeric user_id, instead of email address; To
>> support other data types, as Hongbin mentioned, the storage cost is very
>> high, we don't have that plan.
>> 
>> 
>> 
>> 
>> 
>> 2016-01-28 20:54 GMT+08:00 hongbin ma <mahong...@apache.org>:
>> 
>>> KYLIN-1186 <https://issues.apache.org/jira/browse/KYLIN-1186> is not a
>>> mature feature yet and it only supports integer
>>> we don't yet have plans to support any other forms of precise distinct
>>> count, as it is too expensive to pre-calculate
>>> 
>>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L <abhil...@infoworks.io>
>>> wrote:
>>> 
>>>> Thanks ShaoFeng Shi,
>>>> 
>>>> We might need for other data types as well
>>>> 
>>>> date & string
>>>> 
>>>> (eg, distinct count of dates of certain activity)
>>>> 
>>>> So in the rest call instead of hllc return type it should be bitmap for
>>>> int,tinyint etc ?
>>>> 
>>>> And we still send it as hllc for other data types ?
>>>> 
>>>> 
>>>> Also in one of the comments, it said we cast long to int..  wont we be
>>>> losing data due to truncation ?
>>>> 
>>>> 
>>>> Regards,
>>>> Abhilash
>>>> 
>>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi <shaofeng...@apache.org>
>>>> wrote:
>>>> 
>>>>> is this matched your case?
>>>>> https://issues.apache.org/jira/browse/KYLIN-1186
>>>>> 
>>>>> 2016-01-28 17:42 GMT+08:00 Abhilash L L <abhil...@infoworks.io>:
>>>>> 
>>>>>> +user ml
>>>>>> 
>>>>>> Regards,
>>>>>> Abhilash
>>>>>> 
>>>>>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
>>> abhil...@infoworks.io>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>>   Is there a way to ask Kylin to get exact distinct count ?  From
>>>> what
>>>>>> we
>>>>>>> understand, we can choose between hllc(10) to hllc(16)
>>>>>>> 
>>>>>>>   I understand that for every cuboid, you will need to go through
>>>> the
>>>>>>> whole data set again, but with the new cubing algo (2.x branch)
>>>> should
>>>>> be
>>>>>>> simpler to add ?
>>>>>>> 
>>>>>>>   If currently not present are there any plans to introduce this
>>> ?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Abhilash
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> 
>>>>> Shaofeng Shi
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> 
>>> *Bin Mahone | 马洪宾*
>>> Apache Kylin: http://kylin.io
>>> Github: https://github.com/binmahone
>>> 
>> 
>> 
>> 
>> --
>> Best regards,
>> 
>> Shaofeng Shi
>> 
>> 
> 
> 
> -- 
> Best regards,
> 
> Shaofeng Shi

Reply via email to