Hi Luoc,

First, what poor soul is asked to deal with large amounts of XML in this
day and age? I thought we were past the XML madness, except in Maven and
Hadoop config files.

XML is much like JSON, only worse. JSON at least has well-defined types
that can be gleaned from JSON syntax. With XML...? Anything goes because
XML is a document mark-up language, not a data structure description
language.

The classic problem with XML is that if XML is used to describe a
reasonable data structure (rows and columns), then it can reasonably be
parsed into rows and columns. If XML represents a document (or a
relationship graph), then there is no good mapping to rows and columns.
This was true 20 years ago and it is true today.

So, suppose your XML represents row-like data. Then an XML parser could
hope for the best and make a good guess at the types and structure. The XML
parser could work like the new & improved JSON parser (based on EVF2) which
Vitalii is working on. (I did the original work and Vitalli has the
thankless task of updating that work to match the current code.) That JSON
parser is VERY complex as it infers types on the fly. Quick, what type is
"a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
{"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"

An XML parser could use the same tricks. In fact, it can probably use the
same code. In JSON, the parser sends events, and the Drill code does its
type inference magic based on those events. An XML parser can emit similar
events, and make similar decisions.

As you noted, if we have a DTD, we don't have to do schema inference. But,
we do have to do DTD-to-rows-and-columns inference. Once do that, we use
the provided schema as you suggested. (The JSON reader I mentioned already
supports a provided schema to add sanity to the otherwise crazy JSON type
inference process when data is sparse and changing.)

In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
make those same decisions. Hopefully someone has already done that and
users would be willing to use that fancy tool to convert their XML to JSON
before using Drill. (Of course, if they want good performance, they should
have converted XML to Parquet instead.)

So, rather than have a super-fancy Drill XML reader, maybe find a
super-fancy XML-to-Parquet converter, use that once, and then let Drill
quickly query Parquet. The results will be much better than trying to parse
XML over and over on each query. Just because we *can* do it doesn't mean
we *should*.

Thanks,

- Paul



On Wed, Apr 6, 2022 at 5:01 AM luoc <l...@apache.org> wrote:

> 
> Hello dear driller,
>
> Before starting the topic, I would like to do a simple survey :
>
> 1. Did you know that Drill already supports XML format?
>
> 2. If yes, what is the maximum size for the XML files you normally read? 1MB,
> 10MB or 100MB
>
> 3. Do you expect that reading XML will be as easy as JSON (Schema
> Discovery)?
>
> Thank you for responding to those questions.
>
> XML is different from the JSON file, and if we rely solely on the Drill
> drive to deduce the structure of the data. (or called *SCHEMA*), the code
> will get very complex and delicate.
>
> For example, inferring array structure and numeric range. So, "provided
> schema" or "TO_JSON" may be good medicine :
>
> *Provided Schema*
>
> We can add the DTD or XML Schema (XSD) support for the XML. It can build
> all value vectors (Writer) before reading data, solving the fields, types,
> and complex nested.
>
> However, a definition file is actually a rule validator that allows
> elements to appear 0 or more times. As a result, it is not possible to know
> if all elements exist until the data is read.
>
> Therefore, avoid creating a large number of value vectors that do not
> actually exist before reading the data.
>
> We can build the top schema at the initial stage and add new value vectors
> as needed during the reading phase.
>
> *TO_JSON*
>
> Read and convert XML directly to JSON, using the JSON Reader for data
> resolution.
>
> It makes it easier for us to query the XML data such as JSON, but requires
> reading the whole XML file in memory.
>
> I think the two can be done, so I look forward to your spirited discussion.
>
> Thanks.
>
> - luoc
>

Reply via email to