Druid multi-value columns + Calcite

Gian Merlino Tue, 14 Feb 2017 17:09:37 -0800

Hey Calcites,

I'm hoping for some feedback on how to best handle Druid multi-value
columns in SQL.


Background: Druid has a "multi-value column" feature that lets you have
string columns with more than one value per row. Details and examples of
how this works are here:
http://druid.io/docs/latest/querying/multi-value-dimensions.html. The short
version is that filtering on values works under the rule that "rows match a
filter if any value in a multi-value dimension matches your predicate".
Grouping works by sort of causing an explosion into multiple result rows,
similar to what Pig does when you flatten a bag. Selecting without grouping
doesn't do the exploding thing; instead it gives you the array of values.

These behaviors are intended to make multi-value columns work well to hold
data like "tags" or "keywords" where you might want to ask questions like:
"how many rows have the tag 't1'" or "count the number of distinct users
for each tag".

The current Calcite-based Druid SQL stuff doesn't handle this in any way
that really makes sense. The biggest issue is the expression simplifier,
which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false".
But, it's possible for a row to match that if "tags" is multi-value.
Another issue is that the type is reported as a simple "varchar" and there
is no indication that multiple values are possible.

I'm wondering what _would_ make the most sense in the SQL framework. The
simplest thing is to keep reporting it as "varchar", adjust the expression
simplifying rules to be aware of the fact that some optimizations shouldn't
be applied to multi-value columns, and leave it at that. The behavior
wouldn't be quite what you would expect for a varchar type but it should
"work" in a sense.

Or we could report a different type than "varchar" and maybe do some other
things differently too?

Gian

Druid multi-value columns + Calcite

Reply via email to