>>>> Sorted indexes are a viable approach to OLAP storage — Druid[1] does it, and so does SAP HANA. The idea is that if you sort and compress your data it becomes very compact, so you can do very fast scans. So fast that you don’t need to pre-aggregate it.
Yes, the problem (which I think you have covered below) is that you can only sort on a column of interest... And you can sort again on other columns among all rows where the first column has the same value.... But then, if you were to filter by second column - you will still need to scan entire table. Very similar to the analogy in our blog.(search for all English words whose second letter is 'a') And, as your filtering query becomes complex, it becomes very difficult. I believe Druid is optimized for time series analytics (how much by minute, hour, day etc..). Not sure about multidimensional aggregations... >>>> Elasticsearch is an index but it is not an OLAP index - their use case does not call for compressing numeric data, and they optimize for point lookups rather than scans. We use ES only to serve pre-aggregated cube data and not to index the raw data to produce OLAP cubes. >>>>> The best OLAP indexes are able to combine multiple indexes. E.g. take two not-very-selective conditions and make a selective condition. The poorer ones can only use one index, so to get coverage you need to build more indexes. Can you elaborate on Not-so-selective condition? I am a bit lost on the context. Multiple indexing is what we take advantage of. ES, by default indexes on all fields of a document. We store a multidimensional aggregation as an ES document whose fields are the various dimensions and metrics associated with the aggregation. Thus the cube can be sliced and diced on any dimension and filtered on metrics as well.. And again, this indexing is completely different from indexing on raw data or table data. We are dealing with data cubes here. Best, Sarnath
