[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

Paul Rogers (JIRA) Wed, 19 Jul 2017 23:33:47 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094260#comment-16094260
 ]


Paul Rogers commented on DRILL-4264:
------------------------------------

On the planner side, we represent field references with the {{FieldReference}} 
class you mentioned. {{FieldReference}} extends {{SchemaPath}}. These classes 
break names down into one object per name part.

Assume we have {{SELECT a.b.c, a."d.e" ...}}

Within the {{FieldReference}} itself, we hold onto the name using a 
{{PathSegment}} which has two subclasses: {{ArraySegment}} and {{NameSegment}}. 
So, as you noted, in the planner, we can tell the difference between the two 
cases (using functional notation):

{code}
a.b.c: FieldReference(NameSegment("a", NameSegment("b", NameSegment("c"))))
a."d.e": FieldReference(NameSegment("a", NameSegment("d.e")))
{code}

So far so good. Bug, {{SchemaPath}} provides the {{getAsUnescapedPath()}} 
method which concatenates the parts of the name using dots. We end up with two 
{{FieldReference}} instances. Calling {{getAsUnescapedPath()}} on each produces 
{{a.b.c}} and {{a.d.e}}. So, if anything actually uses this unescaped path, we 
end up with an ambiguity: does "a.d.e" represent one field, two fields or three 
fields? We cannot tell.

Now, if this method was only used for debugging (line {{toString()}}), it would 
be fine. But, in fact, many operators refer to this method, especially when 
creating the run-time representation of a field schema: {{MaterializedField}}:

>From {{StreamingAggBatch}}:

{code}
      final MaterializedField outputField = MaterializedField.create(
                    ne.getRef().getAsUnescapedPath(), expr.getMajorType());
{code}

In our examples, we end up with two materialized fields: one called "a.b.c", 
the other "a.d.e", so the ambiguity persists.

As it turns out, each {{MaterializedField}} represents one field or structured 
column. So, our map "a" is represented by a {{MaterializedField}}, "b" by 
another, "c" by yet another and "d.e" by another. So, each should correspond to 
a single name part.

But, the code doesn't work that way, it actually builds up the full unescaped 
name.

Now, I suspect that the code here is old and inconsistent. It should be that 
creating a materialized field pulls out only one name part. But, the code 
actually concatenates. My suspicion increases when I see methods like these in 
{{MaterializedField}}:

{code}
  public String getPath() { return getName(); }
  public String getLastName() { return getName(); }
  public String getName() { return name; }
{code}

That first one really worries me: it is asking for the "path", which means 
dotted name. There are many references to this name. Does this mean the code 
expects to get a string (not a {{NameSegment}}) that holds the composite name. 
If so, we are in trouble.

Now, as it turns out, it seems that the "modern" form of {{MaterializedSchema}} 
is that each hold just one name part. So:

{code}
MaterializedField(name="a", children = (
  MaterializedField(name="b", children = (
    MaterializedField(name = c))),
  MaterializedField(name="d.e")))
{code}

I wonder, because the code appears to be written assuming that a 
{{MaterializedField}} had a path name, does any code still rely on this fact, 
then split the name at dots to get fields?

If not, can we remove the {{getPath()}}, and {{getLastPath()}} methods to make 
it clear that each {{MaterializedField}} corresponds to a single 
{{NameSegment}}?

And, if we do that, should we remove all calls to 
{{NameSegment.getAsUnescapedPath()}} to make clear that we never (except for 
display) want dotted, combined path name?

By carefully looking at the above issues, we can be sure that no old code in 
Drill tries to concatenate "a" and "d.e" to get the name "a.d.e" which it then 
splits into "a", "d" and "e".

A quick search for ".split(" found a number of places where we split names on a 
dot, including in the Parquet Metadata file:

{code}
        public Object deserializeKey(String key, 
com.fasterxml.jackson.databind.DeserializationContext ctxt)
            throws IOException, 
com.fasterxml.jackson.core.JsonProcessingException {
          return new Key(key.split("\\."));
        }
{code}

Are there others? Do these need to be fixed?

> Dots in identifier are not escaped correctly
> --------------------------------------------
>
>                 Key: DRILL-4264
>                 URL: https://issues.apache.org/jira/browse/DRILL-4264
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen
>            Reporter: Alex
>            Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
>     {
>       "0.0.1":{
>         "version":"0.0.1",
>         "date_created":"2014-03-15"
>       },
>       "0.1.2":{
>         "version":"0.1.2",
>         "date_created":"2014-05-21"
>       }
>     }
> {code}
> ... there is no way to select any of the rows since their identifiers contain 
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference 
> "0.0.1"; a field reference identifier must not have the form of a qualified 
> name
> This must be fixed since there are many json data files containing dots in 
> some of the keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly

Reply via email to