XML will never die. The Cobol programmers were reincarnated and built similarly long-lasting generators of XML.
If you have a schema, then it is a reasonable format for Drill to parse, if only to turn around and write to another format. On Wed, Apr 6, 2022 at 7:31 PM Paul Rogers <par0...@gmail.com> wrote: > Hi Luoc, > > First, what poor soul is asked to deal with large amounts of XML in this > day and age? I thought we were past the XML madness, except in Maven and > Hadoop config files. > > XML is much like JSON, only worse. JSON at least has well-defined types > that can be gleaned from JSON syntax. With XML...? Anything goes because > XML is a document mark-up language, not a data structure description > language. > > The classic problem with XML is that if XML is used to describe a > reasonable data structure (rows and columns), then it can reasonably be > parsed into rows and columns. If XML represents a document (or a > relationship graph), then there is no good mapping to rows and columns. > This was true 20 years ago and it is true today. > > So, suppose your XML represents row-like data. Then an XML parser could > hope for the best and make a good guess at the types and structure. The XML > parser could work like the new & improved JSON parser (based on EVF2) which > Vitalii is working on. (I did the original work and Vitalli has the > thankless task of updating that work to match the current code.) That JSON > parser is VERY complex as it infers types on the fly. Quick, what type is > "a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when > {"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!" > > An XML parser could use the same tricks. In fact, it can probably use the > same code. In JSON, the parser sends events, and the Drill code does its > type inference magic based on those events. An XML parser can emit similar > events, and make similar decisions. > > As you noted, if we have a DTD, we don't have to do schema inference. But, > we do have to do DTD-to-rows-and-columns inference. Once do that, we use > the provided schema as you suggested. (The JSON reader I mentioned already > supports a provided schema to add sanity to the otherwise crazy JSON type > inference process when data is sparse and changing.) > > In fact, if you convert XML to JSON, then the XML-to-JSON parser has to > make those same decisions. Hopefully someone has already done that and > users would be willing to use that fancy tool to convert their XML to JSON > before using Drill. (Of course, if they want good performance, they should > have converted XML to Parquet instead.) > > So, rather than have a super-fancy Drill XML reader, maybe find a > super-fancy XML-to-Parquet converter, use that once, and then let Drill > quickly query Parquet. The results will be much better than trying to parse > XML over and over on each query. Just because we *can* do it doesn't mean > we *should*. > > Thanks, > > - Paul > > > > On Wed, Apr 6, 2022 at 5:01 AM luoc <l...@apache.org> wrote: > > > > > Hello dear driller, > > > > Before starting the topic, I would like to do a simple survey : > > > > 1. Did you know that Drill already supports XML format? > > > > 2. If yes, what is the maximum size for the XML files you normally read? > 1MB, > > 10MB or 100MB > > > > 3. Do you expect that reading XML will be as easy as JSON (Schema > > Discovery)? > > > > Thank you for responding to those questions. > > > > XML is different from the JSON file, and if we rely solely on the Drill > > drive to deduce the structure of the data. (or called *SCHEMA*), the code > > will get very complex and delicate. > > > > For example, inferring array structure and numeric range. So, "provided > > schema" or "TO_JSON" may be good medicine : > > > > *Provided Schema* > > > > We can add the DTD or XML Schema (XSD) support for the XML. It can build > > all value vectors (Writer) before reading data, solving the fields, > types, > > and complex nested. > > > > However, a definition file is actually a rule validator that allows > > elements to appear 0 or more times. As a result, it is not possible to > know > > if all elements exist until the data is read. > > > > Therefore, avoid creating a large number of value vectors that do not > > actually exist before reading the data. > > > > We can build the top schema at the initial stage and add new value > vectors > > as needed during the reading phase. > > > > *TO_JSON* > > > > Read and convert XML directly to JSON, using the JSON Reader for data > > resolution. > > > > It makes it easier for us to query the XML data such as JSON, but > requires > > reading the whole XML file in memory. > > > > I think the two can be done, so I look forward to your spirited > discussion. > > > > Thanks. > > > > - luoc > > >