And if there are zero instances what happens (curiosity here)?


On Wed, Apr 6, 2022 at 12:28 PM Lee, David <david....@blackrock.com.invalid>
wrote:

> Which is why using a XSD is more or less full proof..
>
> If the pet element is tagged with maxOccurs="unbounded" it implies it
> should be saved as an array even if there is just one occurrence of <pet>
> in your data.
>
> -----Original Message-----
> From: Ted Dunning <ted.dunn...@gmail.com>
> Sent: Wednesday, April 6, 2022 11:48 AM
> To: dev <dev@drill.apache.org>
> Cc: u...@drill.apache.org
> Subject: Re: [DISCUSS] Add schema support for the XML format
>
> External Email: Use caution with links and attachments
>
>
> That example:
>
> <pet>dog</pet>
> > <pet>cat</pet>
>
>
> can also convert to ["pet":"dog", "pet":"dog']
>
> XML is rife with problems like this.
>
> As you say.
>
> But worse than can be imagined unless you have been hit by these problems.
>
> On Wed, Apr 6, 2022 at 11:39 AM Lee, David <david....@blackrock.com
> .invalid>
> wrote:
>
> > TO_JSON won't work in cases where..
> >
> > One file contains: <pet>dog</pet> which converts to {"pet":"dog"}
> >
> > But another file contains:
> > <pet>dog</pet>
> > <pet>cat</pet>
> > which converts to: {"pet": ["dog", "cat"]}
> >
> > pet as a column in Drill can't be both a varchar and an array of
> > varchar
> >
> > There are a ton of gotcha(s) when dealing with XML..
> > numeric vs string
> > scalar vs array
> >
> > -----Original Message-----
> > From: Lee, David
> > Sent: Wednesday, April 6, 2022 10:54 AM
> > To: u...@drill.apache.org; dev@drill.apache.org
> > Subject: RE: [DISCUSS] Add schema support for the XML format
> >
> > I wrote something to convert XML to JSON using an XSD schema file to
> > solve fields, types, nested structures, etc.. It's the only real way
> > to ensure column level data integrity.
> >
> > https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_
> > _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8
> > HnWC64OWFHods$
> >
> > Converts XML to valid JSON or JSONL Requires only two files to get
> > started. Your XML file and the XSD schema file for that XML file.
> >
> > -----Original Message-----
> > From: luoc <l...@apache.org>
> > Sent: Wednesday, April 6, 2022 5:01 AM
> > To: u...@drill.apache.org; dev@drill.apache.org
> > Subject: [DISCUSS] Add schema support for the XML format
> >
> > External Email: Use caution with links and attachments
> >
> >
> > Hello dear driller,
> >
> > Before starting the topic, I would like to do a simple survey :
> >
> > 1. Did you know that Drill already supports XML format?
> >
> > 2. If yes, what is the maximum size for the XML files you normally read?
> > 1MB, 10MB or 100MB
> >
> > 3. Do you expect that reading XML will be as easy as JSON (Schema
> > Discovery)?
> >
> > Thank you for responding to those questions.
> >
> > XML is different from the JSON file, and if we rely solely on the
> > Drill drive to deduce the structure of the data. (or called SCHEMA),
> > the code will get very complex and delicate.
> >
> > For example, inferring array structure and numeric range. So,
> > "provided schema" or "TO_JSON" may be good medicine :
> >
> > Provided Schema
> >
> > We can add the DTD or XML Schema (XSD) support for the XML. It can
> > build all value vectors (Writer) before reading data, solving the
> > fields, types, and complex nested.
> >
> > However, a definition file is actually a rule validator that allows
> > elements to appear 0 or more times. As a result, it is not possible to
> > know if all elements exist until the data is read.
> >
> > Therefore, avoid creating a large number of value vectors that do not
> > actually exist before reading the data.
> >
> > We can build the top schema at the initial stage and add new value
> > vectors as needed during the reading phase.
> >
> > TO_JSON
> >
> > Read and convert XML directly to JSON, using the JSON Reader for data
> > resolution.
> >
> > It makes it easier for us to query the XML data such as JSON, but
> > requires reading the whole XML file in memory.
> >
> > I think the two can be done, so I look forward to your spirited
> discussion.
> >
> > Thanks.
> >
> > - luoc
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> > immediately and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2022 BlackRock, Inc. All rights reserved.
> >
>

Reply via email to