Question on Representing DFDL/XSD choice data for Drill (Unions required?)

Mike Beckerle Wed, 13 Sep 2023 08:09:31 -0700

I'm thinking whether a first prototype of DFDL integration to Drill should
just use JSON.


But please consider this JSON:

{ "record": [
    { "a": { "a1":5, "a2":6 } },
    { "b": { "b1":55, "b2":66, "b3":77 } }
    { "a": { "a1":7, "a2":8 } },
    { "b": { "b1":77, "b2":88, "b3":99 } }
  ] }

It corresponds to this text data file, parsed using Daffodil:

    105062556677107082778899

The file is a stream of records. The first byte is a tag value 1 for type
'a' records, and 2 for type 'b' records.
The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
a2. They are integers.
The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
and b3. They are integers.
This kind of format is very common, even textualized like this (from COBOL
programs for example)

Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
(and ... does this require the experimental Union feature?)

b1, b2
---------
(55, 66)
(77, 88)

I ask because in an XML Schema or DFDL schema choices with dozens of
'branches' are very common.
Ex: schema for the above data:

<element name="record" maxOccurs="unbounded">
   <complexType>
      <choice><!-- there are sub-record types, a, b,... there could be many
dozens of these -->
          <element name="a">
               <complexType>
                    <sequence>
                        ... many child elements let's say named a1, a2, ...
                     </sequence>
               </complexType>
          </element>
          <element name="b">
               <complexType>
                    <sequence>
                        ... many child elements let's say named b1, b2, b3
...
                     </sequence>
               </complexType>
          </element>
    </choice>
  </complexType>
</element>

To me XSD choice naturally requires a Union feature of some sort.
If that's expermental still in Drill ... what to do?

On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre <notificati...@github.com>
wrote:

> @mbeckerle <https://github.com/mbeckerle>
> You've encountered another challenge that exists in Drill reading data
> without a schema.
> Let me explain a bit about this and I'm going to use the JSON reader as an
> example. First Drill requires data to be homogeneous. Drill does have a
> Union vector type which allows heterogeneous data however this is a bit
> experimental and I wouldn't recommend using it. Also, it really just shifts
> schema inconsistencies to the user.
>
> For instance, let's say you have a column consisting of strings and
> floats. What happens if you try to do something like this:
>
> SELECT sum(mixed_col)-- orSELECT.... ORDER BY mixed_col
>
> Remembering that Drill is distributed and if you have a column with the
> same name and you try to do these operations, they will fail.
>
> Let's say we have data like this:
>
> [
>   {
>      'col1': 'Hi there',
>      'col2': 5.0
>   },
>   {
>      'col1':True,
>      'col2': 4,
>      'col3': 'foo'
>   }
> ]
>
> In older versions of Drill, this kind of data, this would throw all kinds
> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
> <https://github.com/jnturton> submitted apache#2638
> <https://github.com/apache/drill/pull/2638> which overhauled implicit
> casting. What this meant for users is that col2 in the above would be
> automatically cast to a FLOAT and col1 would be automatically cast to a
> VARCHAR.
>
> However, when reading data the story is a little different. What we did
> for the JSON reader was have several read modes. The least tolerant
> attempts to infer all data types. This seems like a great idea in practice,
> however when you start actually using Drill with real data, you start
> seeing the issues with this approach. The JSON reader has a few
> configuration options that increase its tolerance for bad data. The next
> level is readAllNumbersAsDouble which... as the name implies, reads all
> numeric data as Doubles and does not attempt to infer ints vs floats. The
> next options is allTextMode which reads all fields as VARCHAR. This
> should be used when the data is so inconsistent that it cannot be read with
> either mode. These modes can be set globally, at the plugin level or at
> query time.
>
> For the XML reader, I didn't add type inference because I figured the data
> would be quite messy, however it wouldn't be that hard to add basically the
> same levels as the JSON reader.
>
> This fundamental issue exists in all the readers that read data without a
> schema. My rationale for working on the XSD reader is that this will enable
> us to accurately read XML data with all the correct data types.
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/cgivre/drill/pull/6#issuecomment-1666875922>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AALUDAZZ6T6Z44AW44IKD2LXT6RVNANCNFSM6AAAAAA26ZZVQ4>
> .
> You are receiving this because you were mentioned.Message ID:
> <cgivre/drill/pull/6/c1666875...@github.com>
>

Question on Representing DFDL/XSD choice data for Drill (Unions required?)

Reply via email to