Re: [DISCUSS] Add schema support for the XML format

Ted Dunning Wed, 06 Apr 2022 22:00:34 -0700

XML will never die. The Cobol programmers were reincarnated and built
similarly long-lasting generators of XML.


If you have a schema, then it is a reasonable format for Drill to parse, if
only to turn around and write to another format.



On Wed, Apr 6, 2022 at 7:31 PM Paul Rogers <par0...@gmail.com> wrote:

> Hi Luoc,
>
> First, what poor soul is asked to deal with large amounts of XML in this
> day and age? I thought we were past the XML madness, except in Maven and
> Hadoop config files.
>
> XML is much like JSON, only worse. JSON at least has well-defined types
> that can be gleaned from JSON syntax. With XML...? Anything goes because
> XML is a document mark-up language, not a data structure description
> language.
>
> The classic problem with XML is that if XML is used to describe a
> reasonable data structure (rows and columns), then it can reasonably be
> parsed into rows and columns. If XML represents a document (or a
> relationship graph), then there is no good mapping to rows and columns.
> This was true 20 years ago and it is true today.
>
> So, suppose your XML represents row-like data. Then an XML parser could
> hope for the best and make a good guess at the types and structure. The XML
> parser could work like the new & improved JSON parser (based on EVF2) which
> Vitalii is working on. (I did the original work and Vitalli has the
> thankless task of updating that work to match the current code.) That JSON
> parser is VERY complex as it infers types on the fly. Quick, what type is
> "a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
> {"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"
>
> An XML parser could use the same tricks. In fact, it can probably use the
> same code. In JSON, the parser sends events, and the Drill code does its
> type inference magic based on those events. An XML parser can emit similar
> events, and make similar decisions.
>
> As you noted, if we have a DTD, we don't have to do schema inference. But,
> we do have to do DTD-to-rows-and-columns inference. Once do that, we use
> the provided schema as you suggested. (The JSON reader I mentioned already
> supports a provided schema to add sanity to the otherwise crazy JSON type
> inference process when data is sparse and changing.)
>
> In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
> make those same decisions. Hopefully someone has already done that and
> users would be willing to use that fancy tool to convert their XML to JSON
> before using Drill. (Of course, if they want good performance, they should
> have converted XML to Parquet instead.)
>
> So, rather than have a super-fancy Drill XML reader, maybe find a
> super-fancy XML-to-Parquet converter, use that once, and then let Drill
> quickly query Parquet. The results will be much better than trying to parse
> XML over and over on each query. Just because we *can* do it doesn't mean
> we *should*.
>
> Thanks,
>
> - Paul
>
>
>
> On Wed, Apr 6, 2022 at 5:01 AM luoc <l...@apache.org> wrote:
>
> > 
> > Hello dear driller,
> >
> > Before starting the topic, I would like to do a simple survey :
> >
> > 1. Did you know that Drill already supports XML format?
> >
> > 2. If yes, what is the maximum size for the XML files you normally read?
> 1MB,
> > 10MB or 100MB
> >
> > 3. Do you expect that reading XML will be as easy as JSON (Schema
> > Discovery)?
> >
> > Thank you for responding to those questions.
> >
> > XML is different from the JSON file, and if we rely solely on the Drill
> > drive to deduce the structure of the data. (or called *SCHEMA*), the code
> > will get very complex and delicate.
> >
> > For example, inferring array structure and numeric range. So, "provided
> > schema" or "TO_JSON" may be good medicine :
> >
> > *Provided Schema*
> >
> > We can add the DTD or XML Schema (XSD) support for the XML. It can build
> > all value vectors (Writer) before reading data, solving the fields,
> types,
> > and complex nested.
> >
> > However, a definition file is actually a rule validator that allows
> > elements to appear 0 or more times. As a result, it is not possible to
> know
> > if all elements exist until the data is read.
> >
> > Therefore, avoid creating a large number of value vectors that do not
> > actually exist before reading the data.
> >
> > We can build the top schema at the initial stage and add new value
> vectors
> > as needed during the reading phase.
> >
> > *TO_JSON*
> >
> > Read and convert XML directly to JSON, using the JSON Reader for data
> > resolution.
> >
> > It makes it easier for us to query the XML data such as JSON, but
> requires
> > reading the whole XML file in memory.
> >
> > I think the two can be done, so I look forward to your spirited
> discussion.
> >
> > Thanks.
> >
> > - luoc
> >
>

Re: [DISCUSS] Add schema support for the XML format

Reply via email to