Hi, Sarnath,

What we want is a *precisely* distinct count algorithm, as you said, bloom is a 
*probabilistic* data structure, we can’t the precisely result via that.

Secondary index, or inverted index, is a big topic. As I know, we haven’t 
decided how to leverage inverted index in Kylin, maybe we could discuss this in 
another thread.

> 在 2016年1月29日,08:16,Sarnath <stell...@gmail.com> 写道:
> 
> Just thinking out loud: For distinct count for complex data types, bloom
> filter can be considered after hashing them to some hash-code. Bloom is a
> probabilistic data structure that can handle the set-presence enquiries
> faster but with a tentative answer.
> Alternatively a secondary index for the column (or distinct values of that
> column) through Solr/ElasticSearch may also work.
> On Jan 29, 2016 2:41 AM, "Li Yang" <liy...@apache.org> wrote:
> 
>> If you can figure out a good mapping between date/string to int/long, then
>> the bitmap is a good solution. E.g. date maps to integer very well.
>> 
>> Expect community will have more contributions in this area.
>> 
>> 
>> On Friday, January 29, 2016, Yerui Sun <sunye...@gmail.com> wrote:
>> 
>>> Thanks shaofeng and hongbin for your explaining.
>>> 
>>> Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some
>>> thinking about the designing:
>>> 
>>> We indeed just support Int type for now, and cast Long to Int may cause
>>> precision losing (shaofeng removed the casting and I agreed that), the
>>> reason mainly is Int has been enough for most cases.
>>> 
>>> I thought about to support all types, including String or Date, and the
>>> conclusion is that’s difficult. One solution is store all the values,
>>> that’s appear too costly, and another solution is finding the *precisely*
>>> projecting from string to int, for example dict ( not hash, because the
>>> projecting maybe conflicting).
>>> However, the dict generating is still difficult, especially when the
>>> cardinality is very high. I think KYLIN-1122 facing the same problem, so
>>> let’s see what’s the solution in KYLIN-1122, maybe we could borrow
>>> something.
>>> 
>>> The reason of casting Long to Int is that bitmap based on RoaringBitmap,
>>> which maintained by lemire(lem...@gmail.com <javascript:;>), just
>>> supporting Integer. Expanding it to Long is kind of complicated, so I
>>> skipped that for now.
>>> 
>>> Overall, this feature just fitted the common user case, and has
>> absolutely
>>> room for improvement. Please let me know if you have any idea, and any
>>> comment is welcome.
>>> 
>>> 
>>> 
>>>> 在 2016年1月28日,22:33,ShaoFeng Shi <shaofeng...@apache.org
>> <javascript:;>>
>>> 写道:
>>>> 
>>>> I removed the code for long type in BitmapCounter as the casting will
>> get
>>>> things wrong (but the target is to provide accurate value); @Yerui, for
>>> you
>>>> awareness; once we find the solution for long, then add it back.
>>>> 
>>>> 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi <shaofeng...@apache.org
>>> <javascript:;>>:
>>>> 
>>>>> what's the cardinality of the dimension that you want to count
>> distinct
>>>>> values? Integer's range is enough for most cases, if your case is
>> under
>>>>> this scope, you can try the bitmap with integer; but you need map the
>>> value
>>>>> to an unique id and use that within the bitmap. For example, if you
>>> want to
>>>>> count distinct users, use the numeric user_id, instead of email
>>> address; To
>>>>> support other data types, as Hongbin mentioned, the storage cost is
>> very
>>>>> high, we don't have that plan.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2016-01-28 20:54 GMT+08:00 hongbin ma <mahong...@apache.org
>>> <javascript:;>>:
>>>>> 
>>>>>> KYLIN-1186 <https://issues.apache.org/jira/browse/KYLIN-1186> is
>> not a
>>>>>> mature feature yet and it only supports integer
>>>>>> we don't yet have plans to support any other forms of precise
>> distinct
>>>>>> count, as it is too expensive to pre-calculate
>>>>>> 
>>>>>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L <abhil...@infoworks.io
>>> <javascript:;>>
>>>>>> wrote:
>>>>>> 
>>>>>>> Thanks ShaoFeng Shi,
>>>>>>> 
>>>>>>> We might need for other data types as well
>>>>>>> 
>>>>>>> date & string
>>>>>>> 
>>>>>>> (eg, distinct count of dates of certain activity)
>>>>>>> 
>>>>>>> So in the rest call instead of hllc return type it should be bitmap
>>> for
>>>>>>> int,tinyint etc ?
>>>>>>> 
>>>>>>> And we still send it as hllc for other data types ?
>>>>>>> 
>>>>>>> 
>>>>>>> Also in one of the comments, it said we cast long to int..  wont we
>> be
>>>>>>> losing data due to truncation ?
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Abhilash
>>>>>>> 
>>>>>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi <
>> shaofeng...@apache.org
>>> <javascript:;>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> is this matched your case?
>>>>>>>> https://issues.apache.org/jira/browse/KYLIN-1186
>>>>>>>> 
>>>>>>>> 2016-01-28 17:42 GMT+08:00 Abhilash L L <abhil...@infoworks.io
>>> <javascript:;>>:
>>>>>>>> 
>>>>>>>>> +user ml
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Abhilash
>>>>>>>>> 
>>>>>>>>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
>>>>>> abhil...@infoworks.io <javascript:;>>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>>  Is there a way to ask Kylin to get exact distinct count ?  From
>>>>>>> what
>>>>>>>>> we
>>>>>>>>>> understand, we can choose between hllc(10) to hllc(16)
>>>>>>>>>> 
>>>>>>>>>>  I understand that for every cuboid, you will need to go through
>>>>>>> the
>>>>>>>>>> whole data set again, but with the new cubing algo (2.x branch)
>>>>>>> should
>>>>>>>> be
>>>>>>>>>> simpler to add ?
>>>>>>>>>> 
>>>>>>>>>>  If currently not present are there any plans to introduce this
>>>>>> ?
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Abhilash
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> 
>>>>>>>> Shaofeng Shi
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards,
>>>>>> 
>>>>>> *Bin Mahone | 马洪宾*
>>>>>> Apache Kylin: http://kylin.io
>>>>>> Github: https://github.com/binmahone
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> 
>>>>> Shaofeng Shi
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best regards,
>>>> 
>>>> Shaofeng Shi
>>> 
>>> 
>> 

Reply via email to