[DISCUSS] Add schema support for the XML format

luoc Wed, 06 Apr 2022 05:01:08 -0700


Hello dear driller,

Before starting the topic, I would like to do a simple survey :


1. Did you know that Drill already supports XML format?

2. If yes, what is the maximum size for the XML files you normally read? 1MB, 
10MB or 100MB

3. Do you expect that reading XML will be as easy as JSON (Schema Discovery)?

Thank you for responding to those questions.

XML is different from the JSON file, and if we rely solely on the Drill drive 
to deduce the structure of the data. (or called SCHEMA), the code will get very 
complex and delicate.

For example, inferring array structure and numeric range. So, "provided schema" 
or "TO_JSON" may be good medicine :

Provided Schema

We can add the DTD or XML Schema (XSD) support for the XML. It can build all 
value vectors (Writer) before reading data, solving the fields, types, and 
complex nested.

However, a definition file is actually a rule validator that allows elements to 
appear 0 or more times. As a result, it is not possible to know if all elements 
exist until the data is read.

Therefore, avoid creating a large number of value vectors that do not actually 
exist before reading the data.

We can build the top schema at the initial stage and add new value vectors as 
needed during the reading phase.

TO_JSON

Read and convert XML directly to JSON, using the JSON Reader for data 
resolution.

It makes it easier for us to query the XML data such as JSON, but requires 
reading the whole XML file in memory.

I think the two can be done, so I look forward to your spirited discussion.

Thanks.

- luoc

[DISCUSS] Add schema support for the XML format

Reply via email to