Nevermind. I figured this out. Was due to 'properties' being a reserved keyword. I created a PR to fix the JSON doc on the drill site.
On Sat, Oct 7, 2023 at 1:46 PM Mike Beckerle <mbecke...@apache.org> wrote: > Ok, after weeks of delay.... > > That helps a great deal. You flatten the array of maps into a table of > maps. > > I am confused still about when I must do square brackets versus dot > notation: data['a'] vs. data.a > The JSON documentation for Drill uses dot notation to reach into fields of > a map. > > Ex: from the JSON doc: > > { > "type": "FeatureCollection", > "features": [ > { > "type": "Feature", > "properties": > { > "MAPBLKLOT": "0001001", > "BLKLOT": "0001001", > "BLOCK_NUM": "0001", > "LOT_NUM": "001", > .... > > The query uses SELECT features[0].properties.MAPBLKLOT, FROM ... > Which is using dot notation where in your queries on my JSON you did not > use dot notation. > > I tried revising the queries you wrote using the dot notation, and it was > rejected. "no table named 'data'", but I'm not sure why. > > Ex: > > This works: (your original working query) > > SELECT data['a'], data['b'] FROM (select flatten(record) AS data from > dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0; > > But this fails: > > SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from > dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0; > Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108: > Table 'data' not found > > But your sub-select defines 'data' as, I would assume, a table. > > Can you help me clarify this? > > [Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0) > apache drill> > > > On Mon, Sep 18, 2023 at 11:17 PM Charles Givre <cgi...@gmail.com> wrote: > >> Hi Mike, >> Let me answer your question with some queries: >> >> >>> select * from dfs.test.`record.json`; >> >> +----------------------------------------------------------------------------------+ >> | record >> | >> >> +----------------------------------------------------------------------------------+ >> | >> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}] >> | >> >> +----------------------------------------------------------------------------------+ >> >> Now... I can flatten that like this: >> >> >>> select flatten(record) AS data from dfs.test.`record.json`; >> +----------------------------------------------+ >> | data | >> +----------------------------------------------+ >> | {"a":{"a1":5.0,"a2":6.0},"b":{}} | >> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} | >> | {"a":{"a1":7.0,"a2":8.0},"b":{}} | >> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} | >> +----------------------------------------------+ >> 4 rows selected (0.298 seconds) >> >> You asked about filtering. For this, I broke it up into a subquery, but >> here's how I did that: >> >> >>> SELECT data['a'], data['b'] >> 2..semicolon> FROM (select flatten(record) AS data from >> dfs.test.`record.json`) >> 3..semicolon> WHERE data['b']['b1'] > 60.0; >> +--------+---------------------------------+ >> | EXPR$0 | EXPR$1 | >> +--------+---------------------------------+ >> | {} | {"b1":77.0,"b2":88.0,"b3":99.0} | >> +--------+---------------------------------+ >> 1 row selected (0.379 seconds) >> >> I did all this without the union data type. >> >> Does this make sense? >> Best, >> -- C >> >> >> On Sep 13, 2023, at 11:08 AM, Mike Beckerle <mbecke...@apache.org> wrote: >> >> I'm thinking whether a first prototype of DFDL integration to Drill should >> just use JSON. >> >> But please consider this JSON: >> >> { "record": [ >> { "a": { "a1":5, "a2":6 } }, >> { "b": { "b1":55, "b2":66, "b3":77 } } >> { "a": { "a1":7, "a2":8 } }, >> { "b": { "b1":77, "b2":88, "b3":99 } } >> ] } >> >> It corresponds to this text data file, parsed using Daffodil: >> >> 105062556677107082778899 >> >> The file is a stream of records. The first byte is a tag value 1 for type >> 'a' records, and 2 for type 'b' records. >> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and >> a2. They are integers. >> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, >> b2, >> and b3. They are integers. >> This kind of format is very common, even textualized like this (from COBOL >> programs for example) >> >> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ? >> (and ... does this require the experimental Union feature?) >> >> b1, b2 >> --------- >> (55, 66) >> (77, 88) >> >> I ask because in an XML Schema or DFDL schema choices with dozens of >> 'branches' are very common. >> Ex: schema for the above data: >> >> <element name="record" maxOccurs="unbounded"> >> <complexType> >> <choice><!-- there are sub-record types, a, b,... there could be many >> dozens of these --> >> <element name="a"> >> <complexType> >> <sequence> >> ... many child elements let's say named a1, a2, ... >> </sequence> >> </complexType> >> </element> >> <element name="b"> >> <complexType> >> <sequence> >> ... many child elements let's say named b1, b2, b3 >> ... >> </sequence> >> </complexType> >> </element> >> </choice> >> </complexType> >> </element> >> >> To me XSD choice naturally requires a Union feature of some sort. >> If that's expermental still in Drill ... what to do? >> >> On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre < >> notificati...@github.com> >> wrote: >> >> @mbeckerle <https://github.com/mbeckerle> >> You've encountered another challenge that exists in Drill reading data >> without a schema. >> Let me explain a bit about this and I'm going to use the JSON reader as an >> example. First Drill requires data to be homogeneous. Drill does have a >> Union vector type which allows heterogeneous data however this is a bit >> experimental and I wouldn't recommend using it. Also, it really just >> shifts >> schema inconsistencies to the user. >> >> For instance, let's say you have a column consisting of strings and >> floats. What happens if you try to do something like this: >> >> SELECT sum(mixed_col)-- orSELECT.... ORDER BY mixed_col >> >> Remembering that Drill is distributed and if you have a column with the >> same name and you try to do these operations, they will fail. >> >> Let's say we have data like this: >> >> [ >> { >> 'col1': 'Hi there', >> 'col2': 5.0 >> }, >> { >> 'col1':True, >> 'col2': 4, >> 'col3': 'foo' >> } >> ] >> >> In older versions of Drill, this kind of data, this would throw all kinds >> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton >> <https://github.com/jnturton> submitted apache#2638 >> <https://github.com/apache/drill/pull/2638> which overhauled implicit >> casting. What this meant for users is that col2 in the above would be >> automatically cast to a FLOAT and col1 would be automatically cast to a >> VARCHAR. >> >> However, when reading data the story is a little different. What we did >> for the JSON reader was have several read modes. The least tolerant >> attempts to infer all data types. This seems like a great idea in >> practice, >> however when you start actually using Drill with real data, you start >> seeing the issues with this approach. The JSON reader has a few >> configuration options that increase its tolerance for bad data. The next >> level is readAllNumbersAsDouble which... as the name implies, reads all >> numeric data as Doubles and does not attempt to infer ints vs floats. The >> next options is allTextMode which reads all fields as VARCHAR. This >> should be used when the data is so inconsistent that it cannot be read >> with >> either mode. These modes can be set globally, at the plugin level or at >> query time. >> >> For the XML reader, I didn't add type inference because I figured the data >> would be quite messy, however it wouldn't be that hard to add basically >> the >> same levels as the JSON reader. >> >> This fundamental issue exists in all the readers that read data without a >> schema. My rationale for working on the XSD reader is that this will >> enable >> us to accurately read XML data with all the correct data types. >> >> — >> Reply to this email directly, view it on GitHub >> <https://github.com/cgivre/drill/pull/6#issuecomment-1666875922>, or >> unsubscribe >> < >> https://github.com/notifications/unsubscribe-auth/AALUDAZZ6T6Z44AW44IKD2LXT6RVNANCNFSM6AAAAAA26ZZVQ4 >> > >> . >> You are receiving this because you were mentioned.Message ID: >> <cgivre/drill/pull/6/c1666875...@github.com> >> >> >>