Consider this XML:
<test1>
<int1 x="2">A</int1>
<int1 x="7">B</int1>
<int1 y="3">Y</int1>
<char1 y="4">C</char1>
</test1>
And this drill query:
SELECT * FROM cp.`xml/foo.xml`
I am using datalevel = 1.
The results I get (calling RowSet results.print() in my junit test) are:
#: `attributes` STRUCT<`int1_x` VARCHAR, `int1_y` VARCHAR, `char1_y`
VARCHAR>, `int1` VARCHAR, `char1` VARCHAR
0: {"27", "3", "4"}, "ABY", "C"
So questions:
First, why is it constructing 1 row, not multiple?
The only way I expect to get only 1 row out is if I did a group-by with the
whole row-set having only 1 key value.
Second, why is it concatenating the value strings?
I'd expect to write like: "SELECT '1' AS key, * FROM ...theTable... GROUP
BY key", and only then would I expect concatenation if everything is a
string and concat is somehow the default grouping operation. Even then it's
a stretch.
Here's what I expected to get out after inspecting the schema that was
inferred from the data:
0: {"2", null, null}, "A", null
1: {"7", null, null}, "B", null
2: {null, "3", null}, "Y", null
3: {null, null, "4"}, null, "C"
Those correspond to the 3 columns "attributes", "int1", "char1", where
attributes is itself { int1_x, int1_y, char1_y}.
Third, how would I change my query to get out what I expect?
Lastly, what is the rationale for the name "int1_x" (also int1_y, and
char1_y) ?
I expected to see two separate attributes columns: "attributes_int1" and
"attributes_char1" as maps with non-prefixed children named x, y and y
respectively.
I guess I just don't grok the rationale for how queries work against XML.
The natural XML schema for this XML document is:
<xs:element name="test1">
<xs:complexType>
<xs:choice>
<xs:element name="int1">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="x" type="xs:int"/>
<xs:attribute name="y" type="xs:int"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="char1">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="y" type="xs:int"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
I need to synthesize the same TupleMetadata from this schema that the
current XML reader infers incrementally, so I really need to understand the
rationale, because I wouldn't expect this choice to be entirely flattened
including the attributes.
Thanks for any help
Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com