I'll add to this thread as the developer of the XML plugin for Drill. IMHO, I think it would be a very good idea to add XSD schema support. I've not had time to really dig into that, but it would seem like writing a converter from XSD to Drill's TupleMetadata would be relatively straightforward. Then we'd have to make sure the schema provisioning works, which actually isn't that hard. I started looking into the first part, but got side tracked. In any event, having the schema from an XSD would eliminate the ambiguity in XML files.
As a side benefit, this would allow for easy conversion from XML to JSON. You could simply do a CTAS query on an XML file and output JSON. Best, -- C > On Apr 6, 2022, at 5:41 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > And if there are zero instances what happens (curiosity here)? > > > > On Wed, Apr 6, 2022 at 12:28 PM Lee, David <david....@blackrock.com.invalid> > wrote: > >> Which is why using a XSD is more or less full proof.. >> >> If the pet element is tagged with maxOccurs="unbounded" it implies it >> should be saved as an array even if there is just one occurrence of <pet> >> in your data. >> >> -----Original Message----- >> From: Ted Dunning <ted.dunn...@gmail.com> >> Sent: Wednesday, April 6, 2022 11:48 AM >> To: dev <dev@drill.apache.org> >> Cc: u...@drill.apache.org >> Subject: Re: [DISCUSS] Add schema support for the XML format >> >> External Email: Use caution with links and attachments >> >> >> That example: >> >> <pet>dog</pet> >>> <pet>cat</pet> >> >> >> can also convert to ["pet":"dog", "pet":"dog'] >> >> XML is rife with problems like this. >> >> As you say. >> >> But worse than can be imagined unless you have been hit by these problems. >> >> On Wed, Apr 6, 2022 at 11:39 AM Lee, David <david....@blackrock.com >> .invalid> >> wrote: >> >>> TO_JSON won't work in cases where.. >>> >>> One file contains: <pet>dog</pet> which converts to {"pet":"dog"} >>> >>> But another file contains: >>> <pet>dog</pet> >>> <pet>cat</pet> >>> which converts to: {"pet": ["dog", "cat"]} >>> >>> pet as a column in Drill can't be both a varchar and an array of >>> varchar >>> >>> There are a ton of gotcha(s) when dealing with XML.. >>> numeric vs string >>> scalar vs array >>> >>> -----Original Message----- >>> From: Lee, David >>> Sent: Wednesday, April 6, 2022 10:54 AM >>> To: u...@drill.apache.org; dev@drill.apache.org >>> Subject: RE: [DISCUSS] Add schema support for the XML format >>> >>> I wrote something to convert XML to JSON using an XSD schema file to >>> solve fields, types, nested structures, etc.. It's the only real way >>> to ensure column level data integrity. >>> >>> https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_ >>> _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8 >>> HnWC64OWFHods$ >>> >>> Converts XML to valid JSON or JSONL Requires only two files to get >>> started. Your XML file and the XSD schema file for that XML file. >>> >>> -----Original Message----- >>> From: luoc <l...@apache.org> >>> Sent: Wednesday, April 6, 2022 5:01 AM >>> To: u...@drill.apache.org; dev@drill.apache.org >>> Subject: [DISCUSS] Add schema support for the XML format >>> >>> External Email: Use caution with links and attachments >>> >>> >>> Hello dear driller, >>> >>> Before starting the topic, I would like to do a simple survey : >>> >>> 1. Did you know that Drill already supports XML format? >>> >>> 2. If yes, what is the maximum size for the XML files you normally read? >>> 1MB, 10MB or 100MB >>> >>> 3. Do you expect that reading XML will be as easy as JSON (Schema >>> Discovery)? >>> >>> Thank you for responding to those questions. >>> >>> XML is different from the JSON file, and if we rely solely on the >>> Drill drive to deduce the structure of the data. (or called SCHEMA), >>> the code will get very complex and delicate. >>> >>> For example, inferring array structure and numeric range. So, >>> "provided schema" or "TO_JSON" may be good medicine : >>> >>> Provided Schema >>> >>> We can add the DTD or XML Schema (XSD) support for the XML. It can >>> build all value vectors (Writer) before reading data, solving the >>> fields, types, and complex nested. >>> >>> However, a definition file is actually a rule validator that allows >>> elements to appear 0 or more times. As a result, it is not possible to >>> know if all elements exist until the data is read. >>> >>> Therefore, avoid creating a large number of value vectors that do not >>> actually exist before reading the data. >>> >>> We can build the top schema at the initial stage and add new value >>> vectors as needed during the reading phase. >>> >>> TO_JSON >>> >>> Read and convert XML directly to JSON, using the JSON Reader for data >>> resolution. >>> >>> It makes it easier for us to query the XML data such as JSON, but >>> requires reading the whole XML file in memory. >>> >>> I think the two can be done, so I look forward to your spirited >> discussion. >>> >>> Thanks. >>> >>> - luoc >>> >>> >>> This message may contain information that is confidential or privileged. >>> If you are not the intended recipient, please advise the sender >>> immediately and delete this message. See >>> http://www.blackrock.com/corporate/compliance/email-disclaimers for >>> further information. Please refer to >>> http://www.blackrock.com/corporate/compliance/privacy-policy for more >>> information about BlackRock’s Privacy Policy. >>> >>> >>> For a list of BlackRock's office addresses worldwide, see >>> http://www.blackrock.com/corporate/about-us/contacts-locations. >>> >>> © 2022 BlackRock, Inc. All rights reserved. >>> >>