[
https://issues.apache.org/jira/browse/DRILL-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370696#comment-17370696
]
Charles Givre edited comment on DRILL-7954 at 6/28/21, 4:19 PM:
----------------------------------------------------------------
[~benj641] Thanks for the JIRA. As the author of the XML plugin, let me explain
a bit as this was an issue I encountered when I was developing the plugin.
If you take a look at the docs[1], you'll see at the bottom a section on known
limitations and in that you'll see a bullet "List Support". This issue is
actually describing that limitation.
Why is it a limitation?
The issue is that Drill doesn't know the schema before we start reading the
data. The secondary issue is that XML is ambiguous by nature.
Consider the data below:
{{<row>}}
{{ <field1>}}
{{ <foo>value1</foo>}}
{{ </field1>}}
{{</row>}}
<row>
{{ <field1>}}
{{ <foo>value2</foo>}}
{{ <foo>value3></foo>}}
{{ </field1>}}
</row>
In this case, Drill first sees the field foo and interprets this as a string,
creates a memory vector and all is well. In the second row, Drill has already
established a memory vector for column foo that contains single strings, even
though what we should have is a list and writes the data anyway. The issue is
that when Drill sees the first column called foo, it has no way of knowing that
there are future entries that should be lists, because to quote [~paul-rogers]
"Drill cannot predict the future".
There are a few possible solutions:
# Use an XSD as schema This represents the best way of handling this case.
Since XML documents frequently provide a schema in the form of an XSD link at
the top, one option would be to have Drill automatically pull back the XSD
document (and ideally cache it) use that to build the schema, and then parse
the data accordingly.
# Provide a schema file: The next-best approach would be to create a schema
file and use this as a provided schema file for the data. This functionality
**should** be available in Drill although I'm not sure that the XML plugin can
read the provided schema.
# Add the ability to interpret lists on the fly: This is the arguably the
most complicated and there are a lot of edge cases here. The fundamental
problem is that XML is ambiguous.
[1]: [https://github.com/apache/drill/tree/master/contrib/format-xml]
was (Author: cgivre):
[~benj641] Thanks for the JIRA. As the author of the XML plugin, let me explain
a bit as this was an issue I encountered when I was developing the plugin.
If you take a look at the docs[1], you'll see at the bottom a section on known
limitations and in that you'll see a bullet "List Support". This issue is
actually describing that limitation.
Why is it a limitation?
The issue is that Drill doesn't know the schema before we start reading the
data. The secondary issue is that XML is ambiguous by nature.
Consider the data below:
{{<row>}}
{{ <field1>}}
{{ <foo>value1</foo>}}
{{ </field1>}}
{{</row>}}
{{ <row>}}
{{ <field1>}}
{{ <foo>value2</foo>}}
{{ <foo>value3></foo>}}
{{ </field1>}}
{{ </row>}}
In this case, Drill first sees the field foo and interprets this as a string,
creates a memory vector and all is well. In the second row, Drill has already
established a memory vector for column foo that contains single strings, even
though what we should have is a list and writes the data anyway. The issue is
that when Drill sees the first column called foo, it has no way of knowing that
there are future entries that should be lists, because to quote [~paul-rogers]
"Drill cannot predict the future".
There are a few possible solutions:
# Use an XSD as schema This represents the best way of handling this case.
Since XML documents frequently provide a schema in the form of an XSD link at
the top, one option would be to have Drill automatically pull back the XSD
document (and ideally cache it) use that to build the schema, and then parse
the data accordingly.
# Provide a schema file: The next-best approach would be to create a schema
file and use this as a provided schema file for the data. This functionality
**should** be available in Drill although I'm not sure that the XML plugin can
read the provided schema.
# Add the ability to interpret lists on the fly: This is the arguably the
most complicated and there are a lot of edge cases here. The fundamental
problem is that XML is ambiguous.
[1]: [https://github.com/apache/drill/tree/master/contrib/format-xml]
> XML ability to not concatenate fields and attribute - change presentation of
> data
> ---------------------------------------------------------------------------------
>
> Key: DRILL-7954
> URL: https://issues.apache.org/jira/browse/DRILL-7954
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.19.0
> Reporter: benj
> Priority: Major
>
> With a XML containing these data :
> {noformat}
> <a>
> <attr>
> <set num="0" val="1">x</set>
> <set num="1" val="2">y</set>
> </attr>
> <attr>
> <set num="2" val="a">z</set>
> <set num="3" val="b">a</set>
> </attr>
> </a>
> {noformat}
> {noformat}
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml',
> dataLevel=>1)) as x;
> +-----------------------------------------------+----------------+
> | attributes | attr |
> +-----------------------------------------------+----------------+
> | {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} |
> +-----------------------------------------------+----------------+
> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2))
> as x;
> +---------------------------------+-----+
> | attributes | set |
> +---------------------------------+-----+
> | {"set_num":"01","set_val":"12"} | xy |
> | {"set_num":"23","set_val":"ab"} | za |
> +---------------------------------+-----+
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml',
> dataLevel=>3)) as x;
> +------------+
> | attributes |
> +------------+
> | {} |
> | {} |
> | {} |
> | {} |
> +------------+
> {noformat}
> Attributes and fields with the same name are concatenated and remains
> inexploitable _(maybe the posibility of adding separator should help but it's
> not the point here)_
> In fact that we really need is the ability to obtain something like
> _(depending of the defining level)_ :
> {noformat}
> +----------------------------------------------------------------------------------+
> | attr
> |
> +----------------------------------------------------------------------------------+
> |
> [{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}]
> |
> |
> [{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}]
> |
> +----------------------------------------------------------------------------------+
> +------------------------------------------------+
> | set |
> +------------------------------------------------+
> | {"set":"x","_attributes":{"num":"0","val":"1"}} |
> | {"set":"y","_attributes":{"num":"1","val":"2"}} |
> | {"set":"z","_attributes":{"num":"2","val":"a"}} |
> | {"set":"a","_attributes":{"num":"3","val":"b"}} |
> +------------------------------------------------+
> {noformat}
> _attributes fields could be generated on each level instead of generated with
> path from top level => that will allow to work with data from each level
> without losing information
--
This message was sent by Atlassian Jira
(v8.3.4#803005)