Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

Paul Rogers Wed, 13 Sep 2023 15:41:10 -0700

Hi Mike,

Looks like you are wrestling with two separate issues. The first is how to
read the encoded data that you showed. In Drill, each data format generally
needs its own reader. Drill's reader operator provides all the plumbing
needed to handle multiple format readers, pack data into vectors, handle
projection and all the rest. But, to actually parse a stream of bytes into
data values, Drill needs a reader (AKA format plugin).

If you were to write such a reader for your encoded data format, then each
parse of a data value would write that value to a vector, potentially
creating the vector as needed (and back filling null values). All this is
handled automagically by the enhanced vector framework (EVF).

Now, you *could* do something that reads the encoded records, emits JSON,
and lets the JSON parser parse it again. I suspect you'd find that doing so
is a) much slower, and b) more work than just creating the required reader.
Either way, you'll need code for each format that Daffodil supports but
which Drill does not yet support.

One quick answer: avoid the UNION type if you can. It works, but barely. It
is slow, inefficient, not supported in many operators, and is unknown to
client libraries. Since Daffodil is all about schemas, use the schema to
figure out the correct type. If Daffodil allows UNION types, only then
would it make sense to map them to Drill's UNION type, and deal with the
many limitations.

Finally, on to your query. Let's reference the JSON form. Let's assume that
if you had a COBOL parser, it would end up with the same vector structure
as the JSON parser would produce: different ways to parse data, but same
internal data structures. To do the query you want, you'd have to:

* Flatten the content of `record` in each record. That is, JSON would read
the above as a set of records, each of which has one field called `record`
which is an array of maps (i.e. a repeated map.) Flattening produces a
stream of maps.
* Then, you'd project `a` and `b` to the top level, giving you two
top-level fields called `a` and `b`. Alternatively, project `a.a1`, `a.a2`,
`b.b1` and `b.b2` to the top level. The values for the "missing" map will
be SQL NULL.
* Finally, state your query as usual SELECT b1, b2 WHERE b1 > 10 FROM
...nested unpack queries here...

As noted in a previous response, SQL (and Drill) doesn't have the
expressiveness to do complex queries while leaving data in its original
structured form. That said, this query might actually work:

SELECT `record` WHERE `record`.`b`.`b1` > 10 FROM <your file>

I *think* this will return a set of records, with a `record` array, a `b`
map, and `b1` and `b2` members that satisfies the query. This works because
the '`record`.`b`.`b1` > 10' expression will be FALSE if b1 (or b) is NULL.
Still, for the sake of clients, you'd want to flatten the results to the
top level.

Charles is really the query expert, he might have other tricks that he's
found that work better.

Drill is smart enough to push projection down into the reader: that's one
of the fancy bits that EVF handles. EVF will notice that we only want
`record`.`b`.`b1` and `b2` and won't project the map `a` or any of its
contents. When the reader provides those values, then will simply go into
the bit bucket. (Caveat: there are some limitations on this feature: I have
some long-delayed fixes that you might need.)

I hope this helps.

Thanks,

- Paul

On Wed, Sep 13, 2023 at 8:09 AM Mike Beckerle <mbecke...@apache.org> wrote:

> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
>
> But please consider this JSON:
>
> { "record": [
>     { "a": { "a1":5, "a2":6 } },
>     { "b": { "b1":55, "b2":66, "b3":77 } }
>     { "a": { "a1":7, "a2":8 } },
>     { "b": { "b1":77, "b2":88, "b3":99 } }
>   ] }
>
> It corresponds to this text data file, parsed using Daffodil:
>
>     105062556677107082778899
>
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
>
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
>
> b1, b2
> ---------
> (55, 66)
> (77, 88)
>
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
>
> <element name="record" maxOccurs="unbounded">
>    <complexType>
>       <choice><!-- there are sub-record types, a, b,... there could be many
> dozens of these -->
>           <element name="a">
>                <complexType>
>                     <sequence>
>                         ... many child elements let's say named a1, a2, ...
>                      </sequence>
>                </complexType>
>           </element>
>           <element name="b">
>                <complexType>
>                     <sequence>
>                         ... many child elements let's say named b1, b2, b3
> ...
>                      </sequence>
>                </complexType>
>           </element>
>     </choice>
>   </complexType>
> </element>
>
> To me XSD choice naturally requires a Union feature of some sort.
> If that's expermental still in Drill ... what to do?
>
> On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre <notificati...@github.com
> >
> wrote:
>
> > @mbeckerle <https://github.com/mbeckerle>
> > You've encountered another challenge that exists in Drill reading data
> > without a schema.
> > Let me explain a bit about this and I'm going to use the JSON reader as
> an
> > example. First Drill requires data to be homogeneous. Drill does have a
> > Union vector type which allows heterogeneous data however this is a bit
> > experimental and I wouldn't recommend using it. Also, it really just
> shifts
> > schema inconsistencies to the user.
> >
> > For instance, let's say you have a column consisting of strings and
> > floats. What happens if you try to do something like this:
> >
> > SELECT sum(mixed_col)-- orSELECT.... ORDER BY mixed_col
> >
> > Remembering that Drill is distributed and if you have a column with the
> > same name and you try to do these operations, they will fail.
> >
> > Let's say we have data like this:
> >
> > [
> >   {
> >      'col1': 'Hi there',
> >      'col2': 5.0
> >   },
> >   {
> >      'col1':True,
> >      'col2': 4,
> >      'col3': 'foo'
> >   }
> > ]
> >
> > In older versions of Drill, this kind of data, this would throw all kinds
> > of SchemaChangeExceptions. However, in recent versions of Drill,
> @jnturton
> > <https://github.com/jnturton> submitted apache#2638
> > <https://github.com/apache/drill/pull/2638> which overhauled implicit
> > casting. What this meant for users is that col2 in the above would be
> > automatically cast to a FLOAT and col1 would be automatically cast to a
> > VARCHAR.
> >
> > However, when reading data the story is a little different. What we did
> > for the JSON reader was have several read modes. The least tolerant
> > attempts to infer all data types. This seems like a great idea in
> practice,
> > however when you start actually using Drill with real data, you start
> > seeing the issues with this approach. The JSON reader has a few
> > configuration options that increase its tolerance for bad data. The next
> > level is readAllNumbersAsDouble which... as the name implies, reads all
> > numeric data as Doubles and does not attempt to infer ints vs floats. The
> > next options is allTextMode which reads all fields as VARCHAR. This
> > should be used when the data is so inconsistent that it cannot be read
> with
> > either mode. These modes can be set globally, at the plugin level or at
> > query time.
> >
> > For the XML reader, I didn't add type inference because I figured the
> data
> > would be quite messy, however it wouldn't be that hard to add basically
> the
> > same levels as the JSON reader.
> >
> > This fundamental issue exists in all the readers that read data without a
> > schema. My rationale for working on the XSD reader is that this will
> enable
> > us to accurately read XML data with all the correct data types.
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <https://github.com/cgivre/drill/pull/6#issuecomment-1666875922>, or
> > unsubscribe
> > <
> https://github.com/notifications/unsubscribe-auth/AALUDAZZ6T6Z44AW44IKD2LXT6RVNANCNFSM6AAAAAA26ZZVQ4
> >
> > .
> > You are receiving this because you were mentioned.Message ID:
> > <cgivre/drill/pull/6/c1666875...@github.com>
> >
>

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

Reply via email to