And if there are zero instances what happens (curiosity here)?
On Wed, Apr 6, 2022 at 12:28 PM Lee, David <david....@blackrock.com.invalid> wrote: > Which is why using a XSD is more or less full proof.. > > If the pet element is tagged with maxOccurs="unbounded" it implies it > should be saved as an array even if there is just one occurrence of <pet> > in your data. > > -----Original Message----- > From: Ted Dunning <ted.dunn...@gmail.com> > Sent: Wednesday, April 6, 2022 11:48 AM > To: dev <dev@drill.apache.org> > Cc: u...@drill.apache.org > Subject: Re: [DISCUSS] Add schema support for the XML format > > External Email: Use caution with links and attachments > > > That example: > > <pet>dog</pet> > > <pet>cat</pet> > > > can also convert to ["pet":"dog", "pet":"dog'] > > XML is rife with problems like this. > > As you say. > > But worse than can be imagined unless you have been hit by these problems. > > On Wed, Apr 6, 2022 at 11:39 AM Lee, David <david....@blackrock.com > .invalid> > wrote: > > > TO_JSON won't work in cases where.. > > > > One file contains: <pet>dog</pet> which converts to {"pet":"dog"} > > > > But another file contains: > > <pet>dog</pet> > > <pet>cat</pet> > > which converts to: {"pet": ["dog", "cat"]} > > > > pet as a column in Drill can't be both a varchar and an array of > > varchar > > > > There are a ton of gotcha(s) when dealing with XML.. > > numeric vs string > > scalar vs array > > > > -----Original Message----- > > From: Lee, David > > Sent: Wednesday, April 6, 2022 10:54 AM > > To: u...@drill.apache.org; dev@drill.apache.org > > Subject: RE: [DISCUSS] Add schema support for the XML format > > > > I wrote something to convert XML to JSON using an XSD schema file to > > solve fields, types, nested structures, etc.. It's the only real way > > to ensure column level data integrity. > > > > https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_ > > _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8 > > HnWC64OWFHods$ > > > > Converts XML to valid JSON or JSONL Requires only two files to get > > started. Your XML file and the XSD schema file for that XML file. > > > > -----Original Message----- > > From: luoc <l...@apache.org> > > Sent: Wednesday, April 6, 2022 5:01 AM > > To: u...@drill.apache.org; dev@drill.apache.org > > Subject: [DISCUSS] Add schema support for the XML format > > > > External Email: Use caution with links and attachments > > > > > > Hello dear driller, > > > > Before starting the topic, I would like to do a simple survey : > > > > 1. Did you know that Drill already supports XML format? > > > > 2. If yes, what is the maximum size for the XML files you normally read? > > 1MB, 10MB or 100MB > > > > 3. Do you expect that reading XML will be as easy as JSON (Schema > > Discovery)? > > > > Thank you for responding to those questions. > > > > XML is different from the JSON file, and if we rely solely on the > > Drill drive to deduce the structure of the data. (or called SCHEMA), > > the code will get very complex and delicate. > > > > For example, inferring array structure and numeric range. So, > > "provided schema" or "TO_JSON" may be good medicine : > > > > Provided Schema > > > > We can add the DTD or XML Schema (XSD) support for the XML. It can > > build all value vectors (Writer) before reading data, solving the > > fields, types, and complex nested. > > > > However, a definition file is actually a rule validator that allows > > elements to appear 0 or more times. As a result, it is not possible to > > know if all elements exist until the data is read. > > > > Therefore, avoid creating a large number of value vectors that do not > > actually exist before reading the data. > > > > We can build the top schema at the initial stage and add new value > > vectors as needed during the reading phase. > > > > TO_JSON > > > > Read and convert XML directly to JSON, using the JSON Reader for data > > resolution. > > > > It makes it easier for us to query the XML data such as JSON, but > > requires reading the whole XML file in memory. > > > > I think the two can be done, so I look forward to your spirited > discussion. > > > > Thanks. > > > > - luoc > > > > > > This message may contain information that is confidential or privileged. > > If you are not the intended recipient, please advise the sender > > immediately and delete this message. See > > http://www.blackrock.com/corporate/compliance/email-disclaimers for > > further information. Please refer to > > http://www.blackrock.com/corporate/compliance/privacy-policy for more > > information about BlackRock’s Privacy Policy. > > > > > > For a list of BlackRock's office addresses worldwide, see > > http://www.blackrock.com/corporate/about-us/contacts-locations. > > > > © 2022 BlackRock, Inc. All rights reserved. > > >