Hi Mike,
Let me answer your question with some queries:
>>> select * from dfs.test.`record.json`;
+----------------------------------------------------------------------------------+
| record
|
+----------------------------------------------------------------------------------+
|
[{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
|
+----------------------------------------------------------------------------------+
Now... I can flatten that like this:
>>> select flatten(record) AS data from dfs.test.`record.json`;
+----------------------------------------------+
| data |
+----------------------------------------------+
| {"a":{"a1":5.0,"a2":6.0},"b":{}} |
| {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
| {"a":{"a1":7.0,"a2":8.0},"b":{}} |
| {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
+----------------------------------------------+
4 rows selected (0.298 seconds)
You asked about filtering. For this, I broke it up into a subquery, but
here's how I did that:
>>> SELECT data['a'], data['b']
2..semicolon> FROM (select flatten(record) AS data from dfs.test.`record.json`)
3..semicolon> WHERE data['b']['b1'] > 60.0;
+--------+---------------------------------+
| EXPR$0 | EXPR$1 |
+--------+---------------------------------+
| {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
+--------+---------------------------------+
1 row selected (0.379 seconds)
I did all this without the union data type.
Does this make sense?
Best,
-- C
> On Sep 13, 2023, at 11:08 AM, Mike Beckerle <[email protected]> wrote:
>
> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
>
> But please consider this JSON:
>
> { "record": [
> { "a": { "a1":5, "a2":6 } },
> { "b": { "b1":55, "b2":66, "b3":77 } }
> { "a": { "a1":7, "a2":8 } },
> { "b": { "b1":77, "b2":88, "b3":99 } }
> ] }
>
> It corresponds to this text data file, parsed using Daffodil:
>
> 105062556677107082778899
>
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
>
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
>
> b1, b2
> ---------
> (55, 66)
> (77, 88)
>
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
>
> <element name="record" maxOccurs="unbounded">
> <complexType>
> <choice><!-- there are sub-record types, a, b,... there could be many
> dozens of these -->
> <element name="a">
> <complexType>
> <sequence>
> ... many child elements let's say named a1, a2, ...
> </sequence>
> </complexType>
> </element>
> <element name="b">
> <complexType>
> <sequence>
> ... many child elements let's say named b1, b2, b3
> ...
> </sequence>
> </complexType>
> </element>
> </choice>
> </complexType>
> </element>
>
> To me XSD choice naturally requires a Union feature of some sort.
> If that's expermental still in Drill ... what to do?
>
> On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre <[email protected]>
> wrote:
>
>> @mbeckerle <https://github.com/mbeckerle>
>> You've encountered another challenge that exists in Drill reading data
>> without a schema.
>> Let me explain a bit about this and I'm going to use the JSON reader as an
>> example. First Drill requires data to be homogeneous. Drill does have a
>> Union vector type which allows heterogeneous data however this is a bit
>> experimental and I wouldn't recommend using it. Also, it really just shifts
>> schema inconsistencies to the user.
>>
>> For instance, let's say you have a column consisting of strings and
>> floats. What happens if you try to do something like this:
>>
>> SELECT sum(mixed_col)-- orSELECT.... ORDER BY mixed_col
>>
>> Remembering that Drill is distributed and if you have a column with the
>> same name and you try to do these operations, they will fail.
>>
>> Let's say we have data like this:
>>
>> [
>> {
>> 'col1': 'Hi there',
>> 'col2': 5.0
>> },
>> {
>> 'col1':True,
>> 'col2': 4,
>> 'col3': 'foo'
>> }
>> ]
>>
>> In older versions of Drill, this kind of data, this would throw all kinds
>> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
>> <https://github.com/jnturton> submitted apache#2638
>> <https://github.com/apache/drill/pull/2638> which overhauled implicit
>> casting. What this meant for users is that col2 in the above would be
>> automatically cast to a FLOAT and col1 would be automatically cast to a
>> VARCHAR.
>>
>> However, when reading data the story is a little different. What we did
>> for the JSON reader was have several read modes. The least tolerant
>> attempts to infer all data types. This seems like a great idea in practice,
>> however when you start actually using Drill with real data, you start
>> seeing the issues with this approach. The JSON reader has a few
>> configuration options that increase its tolerance for bad data. The next
>> level is readAllNumbersAsDouble which... as the name implies, reads all
>> numeric data as Doubles and does not attempt to infer ints vs floats. The
>> next options is allTextMode which reads all fields as VARCHAR. This
>> should be used when the data is so inconsistent that it cannot be read with
>> either mode. These modes can be set globally, at the plugin level or at
>> query time.
>>
>> For the XML reader, I didn't add type inference because I figured the data
>> would be quite messy, however it wouldn't be that hard to add basically the
>> same levels as the JSON reader.
>>
>> This fundamental issue exists in all the readers that read data without a
>> schema. My rationale for working on the XSD reader is that this will enable
>> us to accurately read XML data with all the correct data types.
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <https://github.com/cgivre/drill/pull/6#issuecomment-1666875922>, or
>> unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AALUDAZZ6T6Z44AW44IKD2LXT6RVNANCNFSM6AAAAAA26ZZVQ4>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> <cgivre/drill/pull/6/[email protected]>
>>