Re: DICT keys in projection

Paul Rogers Tue, 21 Jan 2020 11:15:48 -0800

HI Bohdan,

Thanks much for the explanations. First I'll explain how I'll solve the 
projection problem given your explanation. Then I'll point out three potential 
ambiguities.

We have a bit of code which parses the projection list, then checks if a column 
from a reader is consistent with projection. The idea is to ensure that the 
columns produced by a Scan will be valid when a Project later tries to use them 
with the given project list. And, if the Scan says it can support 
Project-push-down, then the Scan is obligated to do the full check.

Doing that check is quite easy when the projection is simple: `a`. The column 
`a` can match a data column `a` of any type.

The task is bit harder when the projection is an array `a[0]`. Since this now 
means either array or DICT with an INT key, this projected column can match:

* Any REPEATED type
* A LIST

* A non-REPEATED DICT with INT, BIGINT, SMALLINT or TINYINT keys (ignoring the 
UINTx types)
* A REPEATED DICT with any type of key
* A UNION (because a union might contain a repeated type)

We can also handle a map projection: `a.b` which matches:

* A (possibly repeated) map
* A (possibly repeated) DICT with VARCHAR keys
* A UNION (because a union might contain a possibly-repeated map)
* A LIST (because the list can contain a union which might contain a 
possibly-repeated map)

Things get very complex indeed when we have multiple qualifiers such as 
`a[0][1].b` which matches:

* A LIST that contains a repeated map
* A REPEATED LIST that contains a (possibly-repeated) map
* A DICT with an INT key that has a value of a repeated map
* A REPEATED DICT that contains an INT key that contains a MAP
* (If we had sufficient metadata) A LIST that contains a REPEATED DICT with a 
VARCHAR key.

This is one of those cases I mentioned yesterday: simple types are simple. 
Things like DICT become hugely complex as we have to look first at the key 
type, then the value type, to validate projection against a schema.

In fact, the DICT type introduces an ambiguity. Note above that `a.b` can refer 
to either a REPEATED or non-REPEATD MAP. If non-repeated, `a.b` means to get 
the one value for member `b` of map `a`. But, if the map is REPEATED, this 
means to project an array of `b` values obtained from the array of maps.

For a DICT, there is an ambiguity with `a[0][1]` if the DICT is repeated DICT 
of INT keys and REPEATED BIGINT values: that is ARRAY<DICT<INT, 
ARRAY<BIGINT>>>. Does `a[0][1]` mean to pull out the 0th element of the 
REPEATED DICT, then lookup where the key == 1? Or, does it mean to pull out all 
the DICT array values where the key == 0 and then pull out the 1st value of the 
INT array? That is, because we have an implied (in all members of the array) 
syntax, one can interpret this case as:

repeatedDict[0].valueOf(1) --> ARRAY<BIGINT> -- All the values in the key=1 
array of element 0
or
repeatedDict.valueOf(0)[1] --> ARRAY<BIGINT> -- All the values in the key=0, 
element 1 positions across all DICT elements

It would seem to make sense to prefer the first interpretation. Unfortunately, 
MAPs already use the second interpretation.

Then there are our loosey-goosey types like UNION and LIST which become 
hyper-difficult: it is impossible to validate a projection against a type that 
contains anything. We just have to wait until execution time to see what 
happens to appear inside the type. These are even highly ambiguous:

Is `a[0].b` in a UNION a reference to:

* The 0th element of a LIST which contains a DICT with VARCHAR keys?
* The 0th element of a LIST which contains a MAP that contains column b?
* The 0th element of a REPEATED MAP that contains column b?
* The 0th element of a REPEATED DICT which contains a UNION which contains a 
MAP?
* The slice though a REPEATED LIST where a REPEATED DICT has a VARCHAR key of 
'b'.
* And so on...

That is, since we have no restrictions on what can go into a UNION or a LIST, 
it is possible for paths to be ambiguous if the UNION contains two subtypes 
that both support the requested path. At present our only recourse is to 
observe that no one is crazy enough to set up this scenario -- in prt because 
UNION barely works. But, having a type system which is inherently ambiguous is 
not a Good Thing.

For now, I'm solving the UNION and LIST problem by doing nothing: these types 
don't fully work and no one uses them. So, I'll leave fixing this issue as an 
exercise for some future time.

In other engines, we have a known schema and we can just follow the projection 
path down through the types to see if the path is valid. Since the schema is 
known, the planner can do the projection planning. In Drill, things are 
backward; we generally don't know the shape of data until we've read it; then 
at some point, we have to match the projection against what we've read thus 
far. This leads to all kinds of special cases such as those we've been 
discussing.

As I think about how DICT might work, I wonder about another case: where I want 
to do a lateral join of a DICT with its surrounding record. Given that key 
lookup looks the same as member lookup, how we differentiate the following two 
cases:

WHERE dict['key'] = 10 -- Filter

WHERE dict.key = parent.col -- Join between DICT keys and some other column

The answer is: we can't. We'd need different projection syntax to how that in 
the first case, 'key' is a key value, while in the second case, 'key' is the 
name of one of the two DICT fields.

And, if we change the syntax to differentiate map and DICT keys, then, I've got 
to rethink and recode the projection checking logic I just discussed.

Didn't we pick up DICT from the HIVE MAP type? How does HIVE handle this case 
(recognizing that HIVE has an up-front schema.) What path resolutions rules do 
you think we should use in Drill to avoid ambiguities?

This is, by the way, a reason that I think we've gone too far overboard in 
trying to fit complex types into SQL: the cases get ambiguous, SQL is not 
expressive enough to handle the complex references, and as a result our code 
gets exponentially complex and costly to maintain as we try to force-fit into 
SQL concepts that SQL is not designed to handle.

This stuff is HARD; we're designing a complex data type and language system 
within a context that does not provide the required semantics. This is 
screaming at us: "you're doing it wrong!"

Thanks,
- Paul

    On Tuesday, January 21, 2020, 08:28:35 AM PST, Bohdan Kazydub 
<bohdan.kazy...@gmail.com> wrote:  

 Hi Paul,

the DICT projection acts as a placeholder now, the real implementation is
to be added when we completely switch to EVF. (Actually, at first I used
map projection but added dict projection to separate this two.) These two
projections are currently the same - it can be removed now and added later
if needed.

About key types other than VARCHAR and INT: they are supported, depending
on usage.

During planning, if queried table supports conversion of its row structure
to `RelDataType` (`RelDataType
org.apache.calcite.schema.Table#getRowType(RelDataTypeFactory)`) then such
`` SELECT `dict2`[123.4] FROM cp.`employee.json` `` (and for other
primitive key types) will work as expected (see
`org.apache.drill.exec.hive.complex_types.TestHiveMaps`, there are many
tests with projecting dict's values with different key types). The `key`
will be represented as `ArraySegment` in case of INT, SMALLINT, TINYINT and
`NamedSegment` in other but these same `PathSegment`s contain original key
value which is then used when generating code for projection.

The other 'implementation' is when we query, Parquet file, for example,
there is no (currently) such conversion to `RelDataType` and during
planning each column is considered `ANY`. If one uses an example you
provided, `` SELECT `dict2`[123.4] FROM cp.`employee.json` ``, it will fail
with the same error. But, DOUBLE key can be passed as VARCHAR - '123.4' -
it will be converted to appropriate type, based on
`DictVector#getKeyType()`, but this is not as efficient as in the case
above, because the key is converted for each row. Surely, some improvements
can be done.

Hope, this answers your questions, do ask again if any clarification is
needed.

On Tue, Jan 21, 2020 at 12:11 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Bohdan,
>
> Thanks for your explanation.  My question comes from a little project I'm
> working on to handle projection in EVF. Queries go through two major steps:
> planing and execution. At the planning stage we use SQL syntax for the
> project list. For example:
>
> explain plan for SELECT a, e.`map`.`member`, `dict`['key'], `array`[10]
> FROM cp.`employee.json` e
>
> The planner sends an execution plan to operators. The project list appears
> in JSON. For the above:
>
>    "columns" : [ "`a`", "`map`.`member`", "`dict`.`key`", "`array`[10]" ],
>
> We see that the JSON works as you described:
>
> * The SQL map "map.member" syntax is converted to "`map`.`member`" in the
> JSON plan.
>
> * The SQL DICT "`dict`['key']" syntax is converted to a form identical to
> maps: "`dict`.`key`".
>
> * The SQL DICT/array "`array`[10]" syntax is converted to "`array`[10]" in
> JSON.
>
> That is, on the execution side, we can't tell the difference between a MAP
> and a DICT request. We also can't tell the difference between an Array and
> DICT request. Apparently, because of this, the Schema Path parser does not
> recognize DICT syntax.
>
> Given the way projection works, "a.b" and "a['b']" are identical: either
> works for both a map or a DICT with VARCHAR keys.
>
> I was confused because the "ProjectionType" and "RequestedColumn" classes
> were extended with a DICT projection type. But, as we just saw, it is
> impossible to ever use that projection type (the dict['key'] syntax) in the
> execution engine.
>
> Shall I just remove special support for DICT projection, and just say that
> map and array projection are both compatible with a DICT column?
>
> One other related question. As I recall, a DICT allows any scalar type as
> a key. We saw that VARCHAR keys are converted to map references, INT keys
> are converted to array references. But, what about DOUBLE keys (recognizing
> that such keys are a bad idea):
>
> explain plan for SELECT `dict2`[123.4] FROM cp.`employee.json`
>
> VALIDATION ERROR: From line 1, column 25 to line 1, column 38: Cannot
> apply 'ITEM' to arguments of type 'ITEM(<ANY>, <DECIMAL(4, 1)>)'
>
>
> So. We only support INT and VARCHAR keys in DICT when used with literals.
> Is this intentional?
>
> Obviously, to change this behavior, we'd have to change how columns are
> stored in JSON and we'd have to change the schema path parser. Doing so
> would impact all code that uses schema paths (including the projection
> stuff I'm working on.)
>
> Thanks,
> - Paul
>
>
>
>    On Monday, January 20, 2020, 12:02:29 AM PST, Bohdan Kazydub <
> bohdan.kazy...@gmail.com> wrote:
>
>  Hi Paul,
>
> `SELECT myMap.x ...` and `SELECT myMap['x'] ...` is treated the same in
> Drill - schema path parser recognizes it as `myMap.x` in both cases.
> The same is true for DICT - both `myDict['key1']` and myDict.key1 allows
> Python-like syntax for projecting DICT values, but in schema path it is
> also stored as in the case for MAP -
> myDIct.key1 - as you can see, there is no distinction between MAP and DICT
> based on schema path alone. (Note, that one can't project a `key` in DICT -
> `SELECT myDict.key ...` will be treated as if `value` identified by `key`
> with value 'key' is projected, as in Java's `Map<String, Object> map = ...;
> Object value = map.get("key");`). In case when a key is an integer, schema
> path is the same as in case of array.
>
> Is this what you meant by "schema path parser does not recognize the
> syntax" or do you get an error?
>
> On Mon, Jan 20, 2020 at 5:16 AM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi All,
> >
> > What did we decide to do about projecting DICT values? Drill allows us to
> > project specific MAP members:
> >
> > SELECT myMap.x ...
> >
> > And, Drill allows projecting array members:
> >
> > SELECT myArray[3] ...
> >
> > I thought there was discussion of allowing Python-like syntax for
> > projecting DICT values:
> >
> > SELECT myDict['key1'] ...
> >
> > I tried this with no quotes, single-quotes and back-tick quotes. Seems
> > that the schema path parser does not recognize the syntax.
> >
> > Is there some trick that I'm missing?
> >
> > Thanks,
> > - Paul
> >
> >
>

Re: DICT keys in projection

Reply via email to