I wrote a xml to parquet converter as well.

It basically extends xml to json. Writes the json data into a memory mapped 
file. Reads the memory mapped file into an Apache Arrow columnar table. Saves 
the Arrow table as a parquet file..

https://github.com/davlee1972/xml_to_json
https://arrow.apache.org/docs/python/json.html
https://arrow.apache.org/docs/python/parquet.html

I deal with some data feeds which are 2 TB worth of XML a month and these XML 
files have 5,000 different elements with nesting levels that go 10 levels deep.

On Apr 6, 2022, at 7:30 PM, Paul Rogers <par0...@gmail.com> wrote:

External Email: Use caution with links and attachments


Hi Luoc,

First, what poor soul is asked to deal with large amounts of XML in this
day and age? I thought we were past the XML madness, except in Maven and
Hadoop config files.

XML is much like JSON, only worse. JSON at least has well-defined types
that can be gleaned from JSON syntax. With XML...? Anything goes because
XML is a document mark-up language, not a data structure description
language.

The classic problem with XML is that if XML is used to describe a
reasonable data structure (rows and columns), then it can reasonably be
parsed into rows and columns. If XML represents a document (or a
relationship graph), then there is no good mapping to rows and columns.
This was true 20 years ago and it is true today.

So, suppose your XML represents row-like data. Then an XML parser could
hope for the best and make a good guess at the types and structure. The XML
parser could work like the new & improved JSON parser (based on EVF2) which
Vitalii is working on. (I did the original work and Vitalli has the
thankless task of updating that work to match the current code.) That JSON
parser is VERY complex as it infers types on the fly. Quick, what type is
"a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
{"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"

An XML parser could use the same tricks. In fact, it can probably use the
same code. In JSON, the parser sends events, and the Drill code does its
type inference magic based on those events. An XML parser can emit similar
events, and make similar decisions.

As you noted, if we have a DTD, we don't have to do schema inference. But,
we do have to do DTD-to-rows-and-columns inference. Once do that, we use
the provided schema as you suggested. (The JSON reader I mentioned already
supports a provided schema to add sanity to the otherwise crazy JSON type
inference process when data is sparse and changing.)

In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
make those same decisions. Hopefully someone has already done that and
users would be willing to use that fancy tool to convert their XML to JSON
before using Drill. (Of course, if they want good performance, they should
have converted XML to Parquet instead.)

So, rather than have a super-fancy Drill XML reader, maybe find a
super-fancy XML-to-Parquet converter, use that once, and then let Drill
quickly query Parquet. The results will be much better than trying to parse
XML over and over on each query. Just because we *can* do it doesn't mean
we *should*.

Thanks,

- Paul



On Wed, Apr 6, 2022 at 5:01 AM luoc <l...@apache.org> wrote:


Hello dear driller,

Before starting the topic, I would like to do a simple survey :

1. Did you know that Drill already supports XML format?

2. If yes, what is the maximum size for the XML files you normally read? 1MB,
10MB or 100MB

3. Do you expect that reading XML will be as easy as JSON (Schema
Discovery)?

Thank you for responding to those questions.

XML is different from the JSON file, and if we rely solely on the Drill
drive to deduce the structure of the data. (or called *SCHEMA*), the code
will get very complex and delicate.

For example, inferring array structure and numeric range. So, "provided
schema" or "TO_JSON" may be good medicine :

*Provided Schema*

We can add the DTD or XML Schema (XSD) support for the XML. It can build
all value vectors (Writer) before reading data, solving the fields, types,
and complex nested.

However, a definition file is actually a rule validator that allows
elements to appear 0 or more times. As a result, it is not possible to know
if all elements exist until the data is read.

Therefore, avoid creating a large number of value vectors that do not
actually exist before reading the data.

We can build the top schema at the initial stage and add new value vectors
as needed during the reading phase.

*TO_JSON*

Read and convert XML directly to JSON, using the JSON Reader for data
resolution.

It makes it easier for us to query the XML data such as JSON, but requires
reading the whole XML file in memory.

I think the two can be done, so I look forward to your spirited discussion.

Thanks.

- luoc


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

Reply via email to