Just thinking out loud: For distinct count for complex data types, bloom filter can be considered after hashing them to some hash-code. Bloom is a probabilistic data structure that can handle the set-presence enquiries faster but with a tentative answer. Alternatively a secondary index for the column (or distinct values of that column) through Solr/ElasticSearch may also work. On Jan 29, 2016 2:41 AM, "Li Yang" <liy...@apache.org> wrote:
> If you can figure out a good mapping between date/string to int/long, then > the bitmap is a good solution. E.g. date maps to integer very well. > > Expect community will have more contributions in this area. > > > On Friday, January 29, 2016, Yerui Sun <sunye...@gmail.com> wrote: > > > Thanks shaofeng and hongbin for your explaining. > > > > Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some > > thinking about the designing: > > > > We indeed just support Int type for now, and cast Long to Int may cause > > precision losing (shaofeng removed the casting and I agreed that), the > > reason mainly is Int has been enough for most cases. > > > > I thought about to support all types, including String or Date, and the > > conclusion is that’s difficult. One solution is store all the values, > > that’s appear too costly, and another solution is finding the *precisely* > > projecting from string to int, for example dict ( not hash, because the > > projecting maybe conflicting). > > However, the dict generating is still difficult, especially when the > > cardinality is very high. I think KYLIN-1122 facing the same problem, so > > let’s see what’s the solution in KYLIN-1122, maybe we could borrow > > something. > > > > The reason of casting Long to Int is that bitmap based on RoaringBitmap, > > which maintained by lemire(lem...@gmail.com <javascript:;>), just > > supporting Integer. Expanding it to Long is kind of complicated, so I > > skipped that for now. > > > > Overall, this feature just fitted the common user case, and has > absolutely > > room for improvement. Please let me know if you have any idea, and any > > comment is welcome. > > > > > > > > > 在 2016年1月28日,22:33,ShaoFeng Shi <shaofeng...@apache.org > <javascript:;>> > > 写道: > > > > > > I removed the code for long type in BitmapCounter as the casting will > get > > > things wrong (but the target is to provide accurate value); @Yerui, for > > you > > > awareness; once we find the solution for long, then add it back. > > > > > > 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi <shaofeng...@apache.org > > <javascript:;>>: > > > > > >> what's the cardinality of the dimension that you want to count > distinct > > >> values? Integer's range is enough for most cases, if your case is > under > > >> this scope, you can try the bitmap with integer; but you need map the > > value > > >> to an unique id and use that within the bitmap. For example, if you > > want to > > >> count distinct users, use the numeric user_id, instead of email > > address; To > > >> support other data types, as Hongbin mentioned, the storage cost is > very > > >> high, we don't have that plan. > > >> > > >> > > >> > > >> > > >> > > >> 2016-01-28 20:54 GMT+08:00 hongbin ma <mahong...@apache.org > > <javascript:;>>: > > >> > > >>> KYLIN-1186 <https://issues.apache.org/jira/browse/KYLIN-1186> is > not a > > >>> mature feature yet and it only supports integer > > >>> we don't yet have plans to support any other forms of precise > distinct > > >>> count, as it is too expensive to pre-calculate > > >>> > > >>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L <abhil...@infoworks.io > > <javascript:;>> > > >>> wrote: > > >>> > > >>>> Thanks ShaoFeng Shi, > > >>>> > > >>>> We might need for other data types as well > > >>>> > > >>>> date & string > > >>>> > > >>>> (eg, distinct count of dates of certain activity) > > >>>> > > >>>> So in the rest call instead of hllc return type it should be bitmap > > for > > >>>> int,tinyint etc ? > > >>>> > > >>>> And we still send it as hllc for other data types ? > > >>>> > > >>>> > > >>>> Also in one of the comments, it said we cast long to int.. wont we > be > > >>>> losing data due to truncation ? > > >>>> > > >>>> > > >>>> Regards, > > >>>> Abhilash > > >>>> > > >>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi < > shaofeng...@apache.org > > <javascript:;>> > > >>>> wrote: > > >>>> > > >>>>> is this matched your case? > > >>>>> https://issues.apache.org/jira/browse/KYLIN-1186 > > >>>>> > > >>>>> 2016-01-28 17:42 GMT+08:00 Abhilash L L <abhil...@infoworks.io > > <javascript:;>>: > > >>>>> > > >>>>>> +user ml > > >>>>>> > > >>>>>> Regards, > > >>>>>> Abhilash > > >>>>>> > > >>>>>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L < > > >>> abhil...@infoworks.io <javascript:;>> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hello, > > >>>>>>> > > >>>>>>> Is there a way to ask Kylin to get exact distinct count ? From > > >>>> what > > >>>>>> we > > >>>>>>> understand, we can choose between hllc(10) to hllc(16) > > >>>>>>> > > >>>>>>> I understand that for every cuboid, you will need to go through > > >>>> the > > >>>>>>> whole data set again, but with the new cubing algo (2.x branch) > > >>>> should > > >>>>> be > > >>>>>>> simpler to add ? > > >>>>>>> > > >>>>>>> If currently not present are there any plans to introduce this > > >>> ? > > >>>>>>> > > >>>>>>> Regards, > > >>>>>>> Abhilash > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Best regards, > > >>>>> > > >>>>> Shaofeng Shi > > >>>>> > > >>>> > > >>> > > >>> > > >>> > > >>> -- > > >>> Regards, > > >>> > > >>> *Bin Mahone | 马洪宾* > > >>> Apache Kylin: http://kylin.io > > >>> Github: https://github.com/binmahone > > >>> > > >> > > >> > > >> > > >> -- > > >> Best regards, > > >> > > >> Shaofeng Shi > > >> > > >> > > > > > > > > > -- > > > Best regards, > > > > > > Shaofeng Shi > > > > >