Hey Calcites, I'm hoping for some feedback on how to best handle Druid multi-value columns in SQL.
Background: Druid has a "multi-value column" feature that lets you have string columns with more than one value per row. Details and examples of how this works are here: http://druid.io/docs/latest/querying/multi-value-dimensions.html. The short version is that filtering on values works under the rule that "rows match a filter if any value in a multi-value dimension matches your predicate". Grouping works by sort of causing an explosion into multiple result rows, similar to what Pig does when you flatten a bag. Selecting without grouping doesn't do the exploding thing; instead it gives you the array of values. These behaviors are intended to make multi-value columns work well to hold data like "tags" or "keywords" where you might want to ask questions like: "how many rows have the tag 't1'" or "count the number of distinct users for each tag". The current Calcite-based Druid SQL stuff doesn't handle this in any way that really makes sense. The biggest issue is the expression simplifier, which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false". But, it's possible for a row to match that if "tags" is multi-value. Another issue is that the type is reported as a simple "varchar" and there is no indication that multiple values are possible. I'm wondering what _would_ make the most sense in the SQL framework. The simplest thing is to keep reporting it as "varchar", adjust the expression simplifying rules to be aware of the fact that some optimizations shouldn't be applied to multi-value columns, and leave it at that. The behavior wouldn't be quite what you would expect for a varchar type but it should "work" in a sense. Or we could report a different type than "varchar" and maybe do some other things differently too? Gian
