I'll add to this thread as the developer of the XML plugin for Drill.  IMHO, I 
think it would be a very good idea to add XSD schema support.  I've not had 
time to really dig into that, but it would seem like writing a converter from 
XSD to Drill's TupleMetadata would be relatively straightforward.  Then we'd 
have to make sure the schema provisioning works, which actually isn't that 
hard.  I started looking into the first part, but got side tracked.  In any 
event, having the schema from an XSD would eliminate the ambiguity in XML files.

As a side benefit, this would allow for easy conversion from XML to JSON.  You 
could simply do a CTAS query on an XML file and output JSON.

Best,
-- C


> On Apr 6, 2022, at 5:41 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> And if there are zero instances what happens (curiosity here)?
> 
> 
> 
> On Wed, Apr 6, 2022 at 12:28 PM Lee, David <david....@blackrock.com.invalid>
> wrote:
> 
>> Which is why using a XSD is more or less full proof..
>> 
>> If the pet element is tagged with maxOccurs="unbounded" it implies it
>> should be saved as an array even if there is just one occurrence of <pet>
>> in your data.
>> 
>> -----Original Message-----
>> From: Ted Dunning <ted.dunn...@gmail.com>
>> Sent: Wednesday, April 6, 2022 11:48 AM
>> To: dev <dev@drill.apache.org>
>> Cc: u...@drill.apache.org
>> Subject: Re: [DISCUSS] Add schema support for the XML format
>> 
>> External Email: Use caution with links and attachments
>> 
>> 
>> That example:
>> 
>> <pet>dog</pet>
>>> <pet>cat</pet>
>> 
>> 
>> can also convert to ["pet":"dog", "pet":"dog']
>> 
>> XML is rife with problems like this.
>> 
>> As you say.
>> 
>> But worse than can be imagined unless you have been hit by these problems.
>> 
>> On Wed, Apr 6, 2022 at 11:39 AM Lee, David <david....@blackrock.com
>> .invalid>
>> wrote:
>> 
>>> TO_JSON won't work in cases where..
>>> 
>>> One file contains: <pet>dog</pet> which converts to {"pet":"dog"}
>>> 
>>> But another file contains:
>>> <pet>dog</pet>
>>> <pet>cat</pet>
>>> which converts to: {"pet": ["dog", "cat"]}
>>> 
>>> pet as a column in Drill can't be both a varchar and an array of
>>> varchar
>>> 
>>> There are a ton of gotcha(s) when dealing with XML..
>>> numeric vs string
>>> scalar vs array
>>> 
>>> -----Original Message-----
>>> From: Lee, David
>>> Sent: Wednesday, April 6, 2022 10:54 AM
>>> To: u...@drill.apache.org; dev@drill.apache.org
>>> Subject: RE: [DISCUSS] Add schema support for the XML format
>>> 
>>> I wrote something to convert XML to JSON using an XSD schema file to
>>> solve fields, types, nested structures, etc.. It's the only real way
>>> to ensure column level data integrity.
>>> 
>>> https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_
>>> _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8
>>> HnWC64OWFHods$
>>> 
>>> Converts XML to valid JSON or JSONL Requires only two files to get
>>> started. Your XML file and the XSD schema file for that XML file.
>>> 
>>> -----Original Message-----
>>> From: luoc <l...@apache.org>
>>> Sent: Wednesday, April 6, 2022 5:01 AM
>>> To: u...@drill.apache.org; dev@drill.apache.org
>>> Subject: [DISCUSS] Add schema support for the XML format
>>> 
>>> External Email: Use caution with links and attachments
>>> 
>>> 
>>> Hello dear driller,
>>> 
>>> Before starting the topic, I would like to do a simple survey :
>>> 
>>> 1. Did you know that Drill already supports XML format?
>>> 
>>> 2. If yes, what is the maximum size for the XML files you normally read?
>>> 1MB, 10MB or 100MB
>>> 
>>> 3. Do you expect that reading XML will be as easy as JSON (Schema
>>> Discovery)?
>>> 
>>> Thank you for responding to those questions.
>>> 
>>> XML is different from the JSON file, and if we rely solely on the
>>> Drill drive to deduce the structure of the data. (or called SCHEMA),
>>> the code will get very complex and delicate.
>>> 
>>> For example, inferring array structure and numeric range. So,
>>> "provided schema" or "TO_JSON" may be good medicine :
>>> 
>>> Provided Schema
>>> 
>>> We can add the DTD or XML Schema (XSD) support for the XML. It can
>>> build all value vectors (Writer) before reading data, solving the
>>> fields, types, and complex nested.
>>> 
>>> However, a definition file is actually a rule validator that allows
>>> elements to appear 0 or more times. As a result, it is not possible to
>>> know if all elements exist until the data is read.
>>> 
>>> Therefore, avoid creating a large number of value vectors that do not
>>> actually exist before reading the data.
>>> 
>>> We can build the top schema at the initial stage and add new value
>>> vectors as needed during the reading phase.
>>> 
>>> TO_JSON
>>> 
>>> Read and convert XML directly to JSON, using the JSON Reader for data
>>> resolution.
>>> 
>>> It makes it easier for us to query the XML data such as JSON, but
>>> requires reading the whole XML file in memory.
>>> 
>>> I think the two can be done, so I look forward to your spirited
>> discussion.
>>> 
>>> Thanks.
>>> 
>>> - luoc
>>> 
>>> 
>>> This message may contain information that is confidential or privileged.
>>> If you are not the intended recipient, please advise the sender
>>> immediately and delete this message. See
>>> http://www.blackrock.com/corporate/compliance/email-disclaimers for
>>> further information.  Please refer to
>>> http://www.blackrock.com/corporate/compliance/privacy-policy for more
>>> information about BlackRock’s Privacy Policy.
>>> 
>>> 
>>> For a list of BlackRock's office addresses worldwide, see
>>> http://www.blackrock.com/corporate/about-us/contacts-locations.
>>> 
>>> © 2022 BlackRock, Inc. All rights reserved.
>>> 
>> 

Reply via email to