>>>> Sorted indexes are a viable approach to OLAP storage — Druid[1] does
it, and so does SAP HANA. The idea is that if you sort and compress your
data it becomes very compact, so you can do very fast scans. So fast that
you don’t need to pre-aggregate it.

Yes, the problem (which I think you have covered below) is that you can
only sort on a column of interest... And you can sort again on other
columns among all rows where the first column has the same value.... But
then, if you were to filter by second column - you will still need to scan
entire table. Very similar to the analogy in our blog.(search for all
English words whose second letter is 'a')
And, as your filtering query becomes complex, it becomes very difficult. I
believe Druid is optimized for time series analytics (how much by minute,
hour, day etc..). Not sure about multidimensional aggregations...

>>>> Elasticsearch is an index but it is not an OLAP index - their use case
does not call for compressing numeric data, and they optimize for point
lookups rather than scans.

We use ES only to serve pre-aggregated cube data and not to index the raw
data to produce OLAP cubes.

>>>>> The best OLAP indexes are able to combine multiple indexes. E.g. take
two not-very-selective conditions and make a selective condition. The
poorer ones can only use one index, so to get coverage you need to build
more indexes.

Can you elaborate on Not-so-selective condition? I am a bit lost on the
context.

Multiple indexing is what we take advantage of. ES, by default indexes on
all fields of a document. We store a multidimensional aggregation as an ES
document whose fields are the various dimensions and metrics associated
with the aggregation. Thus the cube can be sliced and diced on any
dimension and filtered on metrics as well.. And again, this indexing is
completely different from indexing on raw data or table data. We are
dealing with data cubes here.

Best,
Sarnath

Reply via email to