[
https://issues.apache.org/jira/browse/DRILL-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391657#comment-17391657
]
ASF GitHub Bot commented on DRILL-7979:
---------------------------------------
cgivre commented on pull request #2283:
URL: https://github.com/apache/drill/pull/2283#issuecomment-891119762
> I started out adding specific, implementation-level comments but I've
paused that to back off and ask: is this really a _self-closing tag_ thing, or
is the situation the same for _any empty element_ that also occurs as a parent
element? In my tests on `master`. the problem is the same for either of the
following, which I believe are also equivalent in the XML spec.
>
> ```
> <!-- self-closing -->
> <foo/>
>
> <!-- just empty -->
> <foo></foo>
> ```
>
> If I've got right end of the stick here then I suggest that we adjust all
the naming to refer to the "empty element" case, rather than the "self-closing"
case.
>
> Next, following on from our comments on Jira and the idea of using maps
for this case, what do you think of the following approach?
>
> 1. When our first encounter with an element `foo` is empty, and therefore
ambiguous in terms of type, we default to the non-leaf case and make it a map.
> 2. For subsequent parent `foo` elements we return populated maps. For
subsequent empty `foo` elements we return empty maps.
> 3. For subsequent leaf elements `<foo>bar</foo>`, which we would normally
map to varchar but where we find that we've already got a map from step 1, we
put the element value into the map under a hardcoded special key, e.g. `{
'__value__': 'bar' }`.
>
> The above will also work in the case when the first element encountered is
empty but has attributes `<foo a='b' />` while the element discarding logic in
the present patch does not discard such elements. If you're not crazy about
this it's no problem and I've probably got a couple more specific remarks to
add on the implementation.
@dzamo Thanks for the response. The real issue is that we don't know the
schema as we're scanning the file, so we have to do the best we can. The issue
is that with the empty fields (self-closing or otherwise) we don't really know
what they are until we see real data. For instance, if we decide to make them
an empty map, we'll get an error if the next record shows up as a scalar. The
current approach was to treat empty fields as scalars which then causes issues
if we encounter a map in the next row.
You asked in an other comment about perhaps treating all empty elements in
the same manner. There was a specific challenge as to how the self closing
tags which is why I made this PR. I'm actually working on another project to
get the XML reader to download a provided schema (the XSD link) which would
actually solve a lot of issues reading XML.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Self-Closing XML Tags Cause Schema Change Exceptions
> ----------------------------------------------------
>
> Key: DRILL-7979
> URL: https://issues.apache.org/jira/browse/DRILL-7979
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> Self closing XML tags are dealt with strangely by java's streaming parser.
> If you have data where you have one row containing a self closing XML tag foo
> (<foo/>) but then in the next row `foo` contains a map or other nested field,
> Drill will throw a schema change exception.
> This proposed fix causes Drill to ignore self-closing tags unless they have
> attributes, which allows data like this to be successfully queried.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)