Which is why using a XSD is more or less full proof..

If the pet element is tagged with maxOccurs="unbounded" it implies it should be 
saved as an array even if there is just one occurrence of <pet> in your data.

-----Original Message-----
From: Ted Dunning <ted.dunn...@gmail.com> 
Sent: Wednesday, April 6, 2022 11:48 AM
To: dev <dev@drill.apache.org>
Cc: u...@drill.apache.org
Subject: Re: [DISCUSS] Add schema support for the XML format

External Email: Use caution with links and attachments


That example:

<pet>dog</pet>
> <pet>cat</pet>


can also convert to ["pet":"dog", "pet":"dog']

XML is rife with problems like this.

As you say.

But worse than can be imagined unless you have been hit by these problems.

On Wed, Apr 6, 2022 at 11:39 AM Lee, David <david....@blackrock.com.invalid>
wrote:

> TO_JSON won't work in cases where..
>
> One file contains: <pet>dog</pet> which converts to {"pet":"dog"}
>
> But another file contains:
> <pet>dog</pet>
> <pet>cat</pet>
> which converts to: {"pet": ["dog", "cat"]}
>
> pet as a column in Drill can't be both a varchar and an array of 
> varchar
>
> There are a ton of gotcha(s) when dealing with XML..
> numeric vs string
> scalar vs array
>
> -----Original Message-----
> From: Lee, David
> Sent: Wednesday, April 6, 2022 10:54 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: RE: [DISCUSS] Add schema support for the XML format
>
> I wrote something to convert XML to JSON using an XSD schema file to 
> solve fields, types, nested structures, etc.. It's the only real way 
> to ensure column level data integrity.
>
> https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_
> _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8
> HnWC64OWFHods$
>
> Converts XML to valid JSON or JSONL Requires only two files to get 
> started. Your XML file and the XSD schema file for that XML file.
>
> -----Original Message-----
> From: luoc <l...@apache.org>
> Sent: Wednesday, April 6, 2022 5:01 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: [DISCUSS] Add schema support for the XML format
>
> External Email: Use caution with links and attachments
>
>
> Hello dear driller,
>
> Before starting the topic, I would like to do a simple survey :
>
> 1. Did you know that Drill already supports XML format?
>
> 2. If yes, what is the maximum size for the XML files you normally read?
> 1MB, 10MB or 100MB
>
> 3. Do you expect that reading XML will be as easy as JSON (Schema 
> Discovery)?
>
> Thank you for responding to those questions.
>
> XML is different from the JSON file, and if we rely solely on the 
> Drill drive to deduce the structure of the data. (or called SCHEMA), 
> the code will get very complex and delicate.
>
> For example, inferring array structure and numeric range. So, 
> "provided schema" or "TO_JSON" may be good medicine :
>
> Provided Schema
>
> We can add the DTD or XML Schema (XSD) support for the XML. It can 
> build all value vectors (Writer) before reading data, solving the 
> fields, types, and complex nested.
>
> However, a definition file is actually a rule validator that allows 
> elements to appear 0 or more times. As a result, it is not possible to 
> know if all elements exist until the data is read.
>
> Therefore, avoid creating a large number of value vectors that do not 
> actually exist before reading the data.
>
> We can build the top schema at the initial stage and add new value 
> vectors as needed during the reading phase.
>
> TO_JSON
>
> Read and convert XML directly to JSON, using the JSON Reader for data 
> resolution.
>
> It makes it easier for us to query the XML data such as JSON, but 
> requires reading the whole XML file in memory.
>
> I think the two can be done, so I look forward to your spirited discussion.
>
> Thanks.
>
> - luoc
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender 
> immediately and delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for 
> further information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>

Reply via email to