Re: Group by + where clause

Li Yang Mon, 14 Dec 2015 00:21:07 -0800

The experiment in the blog didn't address the big data problem, which is
the key challenge. Kylin is designed for big data that cannot fit in
memory. That's why the pre-aggregate approach, and also a big reason why
HBase is selected.


But still the exploration of alternative storage is inspiring. I had a
similar hunch like Julian's that a hybrid of indexed and aggregated might
work out very well. Kylin is always open for a better storage that features
1) big data; 2) fast range scan; 3) read-time coprocessor; 4) secondary
index. HBase's shortcoming is 4).


On Sat, Dec 12, 2015 at 9:49 AM, Sarnath <[email protected]> wrote:

> Hi Luke,
>
> Few points:
>
> 1)
> As I mentioned above, the KV pairs corresponding to an aggregation are
> stored as 1 elastic search document. ES indexes on all fields and takes
> care of the  REST API DSL.
> The KV pairs are same as what kylin stores in hbase. Kylin, as per my
> understanding, breaks the KV pairs among rowkey and columns. The dimensions
> go to rowkey and metrics go to columns. And I believe that's the reason why
> Kylin will do full-scan for the query doled out by Seshu. ES does not
> differentiate between metrics, dimensions. It indexes everything. Hence the
> range queries mentioned by Seshu should also run pretty fast with ES. We
> will experiment that and report here as well.
>
> 2)We don't do SQL to REST API conversion yet. The entire REST API DSL is
> provided by ES. So we don't sweat anything on the REST API.
>
> 3)
> In the Blog, we only claim on the fluctuation in performance, while
> filtering group-by on different dimensions. We don't claim on performance.
> But we will get there soon.
>
> 4)
> Druid, as I understood from Julian's email, does not build cube. It stores
> raw data in a sorted order so that OLAP queries (group by) can be answered
> fast without building cube.
>
> Best,
> Sarnath
> On Dec 12, 2015 5:43 AM, "Luke Han" <[email protected]> wrote:
>
> > Would you mind to share more detail about how you indexing these
> > aggregations and how your query will convert to ES API?
> >
> > BTW, does this similar to Druid doing?
> >
> >
> > >Multiple indexing is what we take advantage of. ES, by default indexes
> on
> > >all fields of a document. We store a multidimensional aggregation as an
> ES
> > >document whose fields are the various dimensions and metrics associated
> > >with the aggregation.
> >
> >
> > Best Regards!
> > ---------------------
> >
> > Luke Han
> >
> > On Sat, Dec 12, 2015 at 3:05 AM, Sarnath <[email protected]> wrote:
> >
> > > >>>> Sorted indexes are a viable approach to OLAP storage — Druid[1]
> does
> > > it, and so does SAP HANA. The idea is that if you sort and compress
> your
> > > data it becomes very compact, so you can do very fast scans. So fast
> that
> > > you don’t need to pre-aggregate it.
> > >
> > > Yes, the problem (which I think you have covered below) is that you can
> > > only sort on a column of interest... And you can sort again on other
> > > columns among all rows where the first column has the same value....
> But
> > > then, if you were to filter by second column - you will still need to
> > scan
> > > entire table. Very similar to the analogy in our blog.(search for all
> > > English words whose second letter is 'a')
> > > And, as your filtering query becomes complex, it becomes very
> difficult.
> > I
> > > believe Druid is optimized for time series analytics (how much by
> minute,
> > > hour, day etc..). Not sure about multidimensional aggregations...
> > >
> > > >>>> Elasticsearch is an index but it is not an OLAP index - their use
> > case
> > > does not call for compressing numeric data, and they optimize for point
> > > lookups rather than scans.
> > >
> > > We use ES only to serve pre-aggregated cube data and not to index the
> raw
> > > data to produce OLAP cubes.
> > >
> > > >>>>> The best OLAP indexes are able to combine multiple indexes. E.g.
> > take
> > > two not-very-selective conditions and make a selective condition. The
> > > poorer ones can only use one index, so to get coverage you need to
> build
> > > more indexes.
> > >
> > > Can you elaborate on Not-so-selective condition? I am a bit lost on the
> > > context.
> > >
> > > Multiple indexing is what we take advantage of. ES, by default indexes
> on
> > > all fields of a document. We store a multidimensional aggregation as an
> > ES
> > > document whose fields are the various dimensions and metrics associated
> > > with the aggregation. Thus the cube can be sliced and diced on any
> > > dimension and filtered on metrics as well.. And again, this indexing is
> > > completely different from indexing on raw data or table data. We are
> > > dealing with data cubes here.
> > >
> > > Best,
> > > Sarnath
> > >
> >
>

Re: Group by + where clause

Reply via email to