Drill representation of XML Complex Type with Simple Content

Mike Beckerle Fri, 04 Aug 2023 13:50:49 -0700

Consider this XML:

<test1>
  <int1 x="2">A</int1>
  <int1 x="7">B</int1>
  <int1 y="3">Y</int1>
  <char1 y="4">C</char1>
</test1>


And this drill query:

SELECT * FROM cp.`xml/foo.xml`

I am using datalevel = 1.

The results I get (calling RowSet results.print() in my junit test) are:

#: `attributes` STRUCT<`int1_x` VARCHAR, `int1_y` VARCHAR, `char1_y`
VARCHAR>, `int1` VARCHAR, `char1` VARCHAR
0: {"27", "3", "4"}, "ABY", "C"

So questions:

First, why is it constructing 1 row, not multiple?

The only way I expect to get only 1 row out is if I did a group-by with the
whole row-set having only 1 key value.

Second, why is it concatenating the value strings?

I'd expect to write like: "SELECT '1' AS key, * FROM ...theTable... GROUP
BY key", and only then would I expect concatenation if everything is a
string and concat is somehow the default grouping operation. Even then it's
a stretch.

Here's what I expected to get out after inspecting the schema that was
inferred from the data:

0: {"2", null, null}, "A", null
1: {"7", null, null}, "B", null
2: {null, "3", null}, "Y", null
3: {null, null, "4"}, null, "C"

Those correspond to the 3 columns "attributes", "int1", "char1", where
attributes is itself { int1_x, int1_y, char1_y}.

Third, how would I change my query to get out what I expect?

Lastly, what is the rationale for the name "int1_x" (also int1_y, and
char1_y) ?
I expected to see two separate attributes columns: "attributes_int1" and
"attributes_char1" as maps with non-prefixed children named  x, y and y
respectively.

I guess I just don't grok the rationale for how queries work against XML.

The natural XML schema for this XML document is:

<xs:element name="test1">
  <xs:complexType>
    <xs:choice>
      <xs:element name="int1">
        <xs:complexType>
          <xs:simpleContent>
            <xs:extension base="xs:string">
              <xs:attribute name="x" type="xs:int"/>

              <xs:attribute name="y" type="xs:int"/>

            </xs:extension>
          </xs:simpleContent>
        </xs:complexType>
      </xs:element>

      <xs:element name="char1">
        <xs:complexType>
          <xs:simpleContent>
            <xs:extension base="xs:string">

              <xs:attribute name="y" type="xs:int"/>

            </xs:extension>
          </xs:simpleContent>
        </xs:complexType>
      </xs:element>

</xs:choice>
</xs:complexType>
</xs:element>

I need to synthesize the same TupleMetadata from this schema that the
current XML reader infers incrementally, so I really need to understand the
rationale, because I wouldn't expect this choice to be entirely flattened
including the attributes.

Thanks for any help

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com

Drill representation of XML Complex Type with Simple Content

Reply via email to