[
https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099918#comment-16099918
]
Volodymyr Vysotskyi commented on DRILL-4264:
--------------------------------------------
Thanks for such detailed analysis.
I agree with you that such deserializing of {{ColumnTypeMetadata_v3.Key}}
objects will cause problems for the fields that contain dots in their names. To
solve this issue I propose to change the structure of the
{{ColumnTypeMetadata_v3.Key}} class. Instead of using an array with the
components of the field name we should use {{SchemaPath}} and serialise it as a
string obtained by calling {{SchemaPath.toExpr()}}. With this change, we also
should update parquet metadata version.
A more complex problem is connected with {{MaterializedField}} class.
{{SchemaPath}} was removed from {{MaterializedField}} class in
[PR-373|https://github.com/apache/drill/pull/373]. One of the reasons for this
refactoring was the assumption that {{MaterializedField}} should have no
knowledge of its parents. Some code in Drill supposes that
{{MaterializedField.getPath()}} returns field path including its parents.
For example in [this
line|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java#L225]
{{MaterializedField}} instance will be created with the name
{{col.getAsUnescapedPath()}}. In [this
line|https://github.com/apache/drill/blob/874bf6296dcd1a42c7cf7f097c1a6b5458010cbb/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L362]
the name with parent field names was used. Using only the field name in the
{{MaterializedField}} will cause problems since the field at the root level may
have the same name as the field, nested in the map.
So full field path should be used in the {{MaterializedField}} class in this
case.
The {{SchemaPath.getSimplePath(field.getPath())}} code is used in many places,
but it does not return the same {{SchemaPath}} that was used to create
{{MaterializedField}} instance.
We should change the implementation of {{MaterializedField}} in such a way that
this code returns the same {{SchemaPath}} which was used to create
{{MaterializedField}} instance.
I think we should store a separate field {{String path}} in
{{MaterializedField}} class with value {{SchemaPath.toExpr()}} and replace all
{{SchemaPath.getAsUnescapedPath()}} calls by the {{SchemaPath.toExpr()}}.
* when the {{MaterializedField}} instance is created using the path
{{SchemaPath.toExpr()}}, the name will be assigned as the last name of the
{{SchemaPath}}.
* when {{MaterializedField}} instance is created using the name, the path will
be the same as the name with backticks.
The less preferred solution is the revert of commit
[PR-373|https://github.com/apache/drill/pull/373]. In this case dots in the
field names will be handled correctly. But such solution will make the
transition to using Apache Arrow more complex (but {{MaterializedField}} was
replaced by {{Flatbuffer Field}}, so the transition is already too complex).
> Dots in identifier are not escaped correctly
> --------------------------------------------
>
> Key: DRILL-4264
> URL: https://issues.apache.org/jira/browse/DRILL-4264
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Codegen
> Reporter: Alex
> Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
> {
> "0.0.1":{
> "version":"0.0.1",
> "date_created":"2014-03-15"
> },
> "0.1.2":{
> "version":"0.1.2",
> "date_created":"2014-05-21"
> }
> }
> {code}
> ... there is no way to select any of the rows since their identifiers contain
> dots and when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference
> "0.0.1"; a field reference identifier must not have the form of a qualified
> name
> This must be fixed since there are many json data files containing dots in
> some of the keys (e.g. when specifying version numbers etc)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)