Just thinking out loud: For distinct count for complex data types, bloom
filter can be considered after hashing them to some hash-code. Bloom is a
probabilistic data structure that can handle the set-presence enquiries
faster but with a tentative answer.
Alternatively a secondary index for the column (or distinct values of that
column) through Solr/ElasticSearch may also work.
On Jan 29, 2016 2:41 AM, "Li Yang" <liy...@apache.org> wrote:

> If you can figure out a good mapping between date/string to int/long, then
> the bitmap is a good solution. E.g. date maps to integer very well.
>
> Expect community will have more contributions in this area.
>
>
> On Friday, January 29, 2016, Yerui Sun <sunye...@gmail.com> wrote:
>
> > Thanks shaofeng and hongbin for your explaining.
> >
> > Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some
> > thinking about the designing:
> >
> > We indeed just support Int type for now, and cast Long to Int may cause
> > precision losing (shaofeng removed the casting and I agreed that), the
> > reason mainly is Int has been enough for most cases.
> >
> > I thought about to support all types, including String or Date, and the
> > conclusion is that’s difficult. One solution is store all the values,
> > that’s appear too costly, and another solution is finding the *precisely*
> > projecting from string to int, for example dict ( not hash, because the
> > projecting maybe conflicting).
> > However, the dict generating is still difficult, especially when the
> > cardinality is very high. I think KYLIN-1122 facing the same problem, so
> > let’s see what’s the solution in KYLIN-1122, maybe we could borrow
> > something.
> >
> > The reason of casting Long to Int is that bitmap based on RoaringBitmap,
> > which maintained by lemire(lem...@gmail.com <javascript:;>), just
> > supporting Integer. Expanding it to Long is kind of complicated, so I
> > skipped that for now.
> >
> > Overall, this feature just fitted the common user case, and has
> absolutely
> > room for improvement. Please let me know if you have any idea, and any
> > comment is welcome.
> >
> >
> >
> > > 在 2016年1月28日,22:33,ShaoFeng Shi <shaofeng...@apache.org
> <javascript:;>>
> > 写道:
> > >
> > > I removed the code for long type in BitmapCounter as the casting will
> get
> > > things wrong (but the target is to provide accurate value); @Yerui, for
> > you
> > > awareness; once we find the solution for long, then add it back.
> > >
> > > 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi <shaofeng...@apache.org
> > <javascript:;>>:
> > >
> > >> what's the cardinality of the dimension that you want to count
> distinct
> > >> values? Integer's range is enough for most cases, if your case is
> under
> > >> this scope, you can try the bitmap with integer; but you need map the
> > value
> > >> to an unique id and use that within the bitmap. For example, if you
> > want to
> > >> count distinct users, use the numeric user_id, instead of email
> > address; To
> > >> support other data types, as Hongbin mentioned, the storage cost is
> very
> > >> high, we don't have that plan.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> 2016-01-28 20:54 GMT+08:00 hongbin ma <mahong...@apache.org
> > <javascript:;>>:
> > >>
> > >>> KYLIN-1186 <https://issues.apache.org/jira/browse/KYLIN-1186> is
> not a
> > >>> mature feature yet and it only supports integer
> > >>> we don't yet have plans to support any other forms of precise
> distinct
> > >>> count, as it is too expensive to pre-calculate
> > >>>
> > >>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L <abhil...@infoworks.io
> > <javascript:;>>
> > >>> wrote:
> > >>>
> > >>>> Thanks ShaoFeng Shi,
> > >>>>
> > >>>> We might need for other data types as well
> > >>>>
> > >>>> date & string
> > >>>>
> > >>>> (eg, distinct count of dates of certain activity)
> > >>>>
> > >>>> So in the rest call instead of hllc return type it should be bitmap
> > for
> > >>>> int,tinyint etc ?
> > >>>>
> > >>>> And we still send it as hllc for other data types ?
> > >>>>
> > >>>>
> > >>>> Also in one of the comments, it said we cast long to int..  wont we
> be
> > >>>> losing data due to truncation ?
> > >>>>
> > >>>>
> > >>>> Regards,
> > >>>> Abhilash
> > >>>>
> > >>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi <
> shaofeng...@apache.org
> > <javascript:;>>
> > >>>> wrote:
> > >>>>
> > >>>>> is this matched your case?
> > >>>>> https://issues.apache.org/jira/browse/KYLIN-1186
> > >>>>>
> > >>>>> 2016-01-28 17:42 GMT+08:00 Abhilash L L <abhil...@infoworks.io
> > <javascript:;>>:
> > >>>>>
> > >>>>>> +user ml
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Abhilash
> > >>>>>>
> > >>>>>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
> > >>> abhil...@infoworks.io <javascript:;>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>>   Is there a way to ask Kylin to get exact distinct count ?  From
> > >>>> what
> > >>>>>> we
> > >>>>>>> understand, we can choose between hllc(10) to hllc(16)
> > >>>>>>>
> > >>>>>>>   I understand that for every cuboid, you will need to go through
> > >>>> the
> > >>>>>>> whole data set again, but with the new cubing algo (2.x branch)
> > >>>> should
> > >>>>> be
> > >>>>>>> simpler to add ?
> > >>>>>>>
> > >>>>>>>   If currently not present are there any plans to introduce this
> > >>> ?
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Abhilash
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Best regards,
> > >>>>>
> > >>>>> Shaofeng Shi
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Regards,
> > >>>
> > >>> *Bin Mahone | 马洪宾*
> > >>> Apache Kylin: http://kylin.io
> > >>> Github: https://github.com/binmahone
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >>
> > >> Shaofeng Shi
> > >>
> > >>
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> >
> >
>

Reply via email to