Hi Mike,
I'm traveling to the BlackHat conference so please forgive the short responses. 
 In its current implementation, Drill can't identify arrays in XML.  Since we 
don't know the schema in advance, when it hits the field int1, it has no way of 
knowing that it is a repeated field or not.  This is one of the reasons why I 
started working on the XSD reader because that will allow Drill to read XML 
arrays.  When I did the original implementation, I thought about having a 
lookahead function which would attempt to see whether the next field has the 
same name and if it does, write an array.  However there are a lot of edge 
cases with that as well.  It's something that could be added, but I wanted to 
get the v1 of the XML reader merged.

Another option would have been to include some separator and a UDF to break 
that field up into a list.  In any event, that's why all those fields get 
concatenated into one field.
Best,
-- C



> On Aug 4, 2023, at 4:50 PM, Mike Beckerle <mbecke...@apache.org> wrote:
> 
> Consider this XML:
> 
> <test1>
>   <int1 x="2">A</int1>
>   <int1 x="7">B</int1>
>   <int1 y="3">Y</int1>
>   <char1 y="4">C</char1>
> </test1>
> 
> And this drill query:
> 
> SELECT * FROM cp.`xml/foo.xml`
> 
> I am using datalevel = 1.
> 
> The results I get (calling RowSet results.print() in my junit test) are:
> 
> #: `attributes` STRUCT<`int1_x` VARCHAR, `int1_y` VARCHAR, `char1_y` 
> VARCHAR>, `int1` VARCHAR, `char1` VARCHAR
> 0: {"27", "3", "4"}, "ABY", "C"
> 
> So questions:
> 
> First, why is it constructing 1 row, not multiple?
> 
> The only way I expect to get only 1 row out is if I did a group-by with the 
> whole row-set having only 1 key value.
> 
> Second, why is it concatenating the value strings?
> 
> I'd expect to write like: "SELECT '1' AS key, * FROM ...theTable... GROUP BY 
> key", and only then would I expect concatenation if everything is a string 
> and concat is somehow the default grouping operation. Even then it's a 
> stretch.
> 
> Here's what I expected to get out after inspecting the schema that was 
> inferred from the data:
> 
> 0: {"2", null, null}, "A", null
> 1: {"7", null, null}, "B", null
> 2: {null, "3", null}, "Y", null
> 3: {null, null, "4"}, null, "C"
> 
> Those correspond to the 3 columns "attributes", "int1", "char1", where 
> attributes is itself { int1_x, int1_y, char1_y}.
> 
> Third, how would I change my query to get out what I expect?
> 
> Lastly, what is the rationale for the name "int1_x" (also int1_y, and 
> char1_y) ?
> I expected to see two separate attributes columns: "attributes_int1" and 
> "attributes_char1" as maps with non-prefixed children named  x, y and y 
> respectively.
> 
> I guess I just don't grok the rationale for how queries work against XML.
> 
> The natural XML schema for this XML document is:
> 
> <xs:element name="test1">
>   <xs:complexType>
>     <xs:choice>
>       <xs:element name="int1">
>         <xs:complexType>
>           <xs:simpleContent>
>             <xs:extension base="xs:string">
>               <xs:attribute name="x" type="xs:int"/>
>               <xs:attribute name="y" type="xs:int"/>
>             </xs:extension>
>           </xs:simpleContent>
>         </xs:complexType>
>       </xs:element>
>       <xs:element name="char1">
>         <xs:complexType>
>           <xs:simpleContent>
>             <xs:extension base="xs:string">
>               <xs:attribute name="y" type="xs:int"/>
>             </xs:extension>
>           </xs:simpleContent>
>         </xs:complexType>
>       </xs:element>
>     </xs:choice>
>   </xs:complexType>
> </xs:element>
> I need to synthesize the same TupleMetadata from this schema that the current 
> XML reader infers incrementally, so I really need to understand the 
> rationale, because I wouldn't expect this choice to be entirely flattened 
> including the attributes.
> 
> Thanks for any help
> 
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org <http://daffodil.apache.org/>
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl 
> <http://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl>
> Owl Cyber Defense | www.owlcyberdefense.com <http://www.owlcyberdefense.com/>
> 
> 
> 

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to