Hi, Sarnath, What we want is a *precisely* distinct count algorithm, as you said, bloom is a *probabilistic* data structure, we can’t the precisely result via that.
Secondary index, or inverted index, is a big topic. As I know, we haven’t decided how to leverage inverted index in Kylin, maybe we could discuss this in another thread. > 在 2016年1月29日,08:16,Sarnath <stell...@gmail.com> 写道: > > Just thinking out loud: For distinct count for complex data types, bloom > filter can be considered after hashing them to some hash-code. Bloom is a > probabilistic data structure that can handle the set-presence enquiries > faster but with a tentative answer. > Alternatively a secondary index for the column (or distinct values of that > column) through Solr/ElasticSearch may also work. > On Jan 29, 2016 2:41 AM, "Li Yang" <liy...@apache.org> wrote: > >> If you can figure out a good mapping between date/string to int/long, then >> the bitmap is a good solution. E.g. date maps to integer very well. >> >> Expect community will have more contributions in this area. >> >> >> On Friday, January 29, 2016, Yerui Sun <sunye...@gmail.com> wrote: >> >>> Thanks shaofeng and hongbin for your explaining. >>> >>> Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some >>> thinking about the designing: >>> >>> We indeed just support Int type for now, and cast Long to Int may cause >>> precision losing (shaofeng removed the casting and I agreed that), the >>> reason mainly is Int has been enough for most cases. >>> >>> I thought about to support all types, including String or Date, and the >>> conclusion is that’s difficult. One solution is store all the values, >>> that’s appear too costly, and another solution is finding the *precisely* >>> projecting from string to int, for example dict ( not hash, because the >>> projecting maybe conflicting). >>> However, the dict generating is still difficult, especially when the >>> cardinality is very high. I think KYLIN-1122 facing the same problem, so >>> let’s see what’s the solution in KYLIN-1122, maybe we could borrow >>> something. >>> >>> The reason of casting Long to Int is that bitmap based on RoaringBitmap, >>> which maintained by lemire(lem...@gmail.com <javascript:;>), just >>> supporting Integer. Expanding it to Long is kind of complicated, so I >>> skipped that for now. >>> >>> Overall, this feature just fitted the common user case, and has >> absolutely >>> room for improvement. Please let me know if you have any idea, and any >>> comment is welcome. >>> >>> >>> >>>> 在 2016年1月28日,22:33,ShaoFeng Shi <shaofeng...@apache.org >> <javascript:;>> >>> 写道: >>>> >>>> I removed the code for long type in BitmapCounter as the casting will >> get >>>> things wrong (but the target is to provide accurate value); @Yerui, for >>> you >>>> awareness; once we find the solution for long, then add it back. >>>> >>>> 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi <shaofeng...@apache.org >>> <javascript:;>>: >>>> >>>>> what's the cardinality of the dimension that you want to count >> distinct >>>>> values? Integer's range is enough for most cases, if your case is >> under >>>>> this scope, you can try the bitmap with integer; but you need map the >>> value >>>>> to an unique id and use that within the bitmap. For example, if you >>> want to >>>>> count distinct users, use the numeric user_id, instead of email >>> address; To >>>>> support other data types, as Hongbin mentioned, the storage cost is >> very >>>>> high, we don't have that plan. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 2016-01-28 20:54 GMT+08:00 hongbin ma <mahong...@apache.org >>> <javascript:;>>: >>>>> >>>>>> KYLIN-1186 <https://issues.apache.org/jira/browse/KYLIN-1186> is >> not a >>>>>> mature feature yet and it only supports integer >>>>>> we don't yet have plans to support any other forms of precise >> distinct >>>>>> count, as it is too expensive to pre-calculate >>>>>> >>>>>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L <abhil...@infoworks.io >>> <javascript:;>> >>>>>> wrote: >>>>>> >>>>>>> Thanks ShaoFeng Shi, >>>>>>> >>>>>>> We might need for other data types as well >>>>>>> >>>>>>> date & string >>>>>>> >>>>>>> (eg, distinct count of dates of certain activity) >>>>>>> >>>>>>> So in the rest call instead of hllc return type it should be bitmap >>> for >>>>>>> int,tinyint etc ? >>>>>>> >>>>>>> And we still send it as hllc for other data types ? >>>>>>> >>>>>>> >>>>>>> Also in one of the comments, it said we cast long to int.. wont we >> be >>>>>>> losing data due to truncation ? >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Abhilash >>>>>>> >>>>>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi < >> shaofeng...@apache.org >>> <javascript:;>> >>>>>>> wrote: >>>>>>> >>>>>>>> is this matched your case? >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-1186 >>>>>>>> >>>>>>>> 2016-01-28 17:42 GMT+08:00 Abhilash L L <abhil...@infoworks.io >>> <javascript:;>>: >>>>>>>> >>>>>>>>> +user ml >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Abhilash >>>>>>>>> >>>>>>>>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L < >>>>>> abhil...@infoworks.io <javascript:;>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Is there a way to ask Kylin to get exact distinct count ? From >>>>>>> what >>>>>>>>> we >>>>>>>>>> understand, we can choose between hllc(10) to hllc(16) >>>>>>>>>> >>>>>>>>>> I understand that for every cuboid, you will need to go through >>>>>>> the >>>>>>>>>> whole data set again, but with the new cubing algo (2.x branch) >>>>>>> should >>>>>>>> be >>>>>>>>>> simpler to add ? >>>>>>>>>> >>>>>>>>>> If currently not present are there any plans to introduce this >>>>>> ? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Abhilash >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> >>>>>>>> Shaofeng Shi >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> >>>>>> *Bin Mahone | 马洪宾* >>>>>> Apache Kylin: http://kylin.io >>>>>> Github: https://github.com/binmahone >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> >>>>> Shaofeng Shi >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> >>>> Shaofeng Shi >>> >>> >>