Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?
Or this single query: with pcapDoc as (select PCAP from `infoset.json`), packets as (select flatten(pcapDoc.PCAP.Packet) as packet from pcapDoc), ipv4Headers as (select packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from packets), ipsrcs as (select ipv4headers.hdr.IPSrc.value as ip from ipv4Headers), ipdests as (select ipv4headers.hdr.IPDest.value as ip from ipv4Headers), ips as (select ip from ipsrcs union select ip from ipdests) select * from ips; On Sun, Oct 8, 2023 at 2:47 PM Mike Beckerle wrote: > Ok. It took some time but putting the infoset.json attachment into /tmp > this SQL pulls out all the IP addresses from the PCAP data: > > use dfs.tmp; > > create or replace view pcapDoc as select PCAP from `infoset.json`; > > create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as > packet from pcapDoc; > > create or replace view ipv4Headers as select > packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from > packets ; > > create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip > from ipv4Headers; > > create or replace view ipdests as select ipv4headers.hdr.IPDest.value as > ip from ipv4Headers; > > create or replace view ips as select ip from ipsrcs union select ip from > ipdests; > > select * from ips; > > On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle > wrote: > >> ... sound of crickets on a summer night . >> >> It would really help me if I could get a response to this inquiry, to >> help me better understand Drill. >> >> I do realize people are busy, and also this was originally sent Aug 25, >> which is the wrong time of year to get timely response to anything. >> Hence, this re-up of the message. >> >> >> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle >> wrote: >> >>> Below is a small JSON output from Daffodil and below that is the same >>> Infoset output as XML. >>> (They're inline in this message, but I also attached them as files) >>> >>> This is just a parse of a small PCAP file with a few ICMP packets in it. >>> It's an example DFDL schema used to illustrate binary file parsing. >>> >>> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this >>> component schema: https://github.com/DFDLSchemas/ethernetIP) >>> >>> My theory is that Drill queries against these should be identical to >>> obtain the same output row contents. >>> That is, since this data has the same schema, whether it is JSON or XML >>> shouldn't affect how you query it. >>> To do that the XML Reader will need the XML schema (or some >>> hand-provided metadata) so it knows what is an array. (Specifically >>> PCAP.Packet is the array.) >>> >>> E.g., if you wanted to get the IPSrc and IPDest fields in a table from >>> all ICMP packets in this file, that query should be the same for the JSON >>> and the XML data. >>> >>> First question: Does that make sense? I want to make sure I'm >>> understanding this right. >>> >>> Second question, since I don't really understand Drill SQL yet. >>> >>> What is a query that would pluck the IPSrc.value and IPDest.value from >>> this data and make a row of each pair of those? >>> >>> The top level is a map with a single element named PCAP. >>> The "table" is PCAP.Packet which is an array (of maps). >>> And within each array item's map the fields of interest are within >>> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header >>> (so maybe IPv4Header is the table?) >>> The two fields within there are IPSrc.value (AS src) and IPDest.value >>> (AS dest) >>> >>> I'm lost on how to tell the query that the table is the array >>> PCAP.Packet, or the IPv4Header within those maybe? >>> >>> Maybe this is easy, but I'm just not grokking it yet so I could use some >>> help here. >>> >>> Thanks in advance. >>> >>> { >>> "PCAP": { >>> "PCAPHeader": { >>> "MagicNumber": "D4C3B2A1", >>> "Version": { >>> "Major": "2", >>> "Minor": "4" >>> }, >>> "Zone": "0", >>> "SigFigs": "0", >>> "SnapLen": "65535", >>> "Network": "1" >>> }, >>> "Packet": [ >>> { >>> "PacketHeader": { >>> "Seconds": "1371631556", >>> "USeconds": "838904", >>> "InclLen": "74", >>> "OrigLen": "74" >>> }, >>> "LinkLayer": { >>> "Ethernet": { >>> "MACDest": "005056E01449", >>> "MACSrc": "000C29340BDE", >>> "Ethertype": "2048", >>> "NetworkLayer": { >>> "IPv4": { >>> "IPv4Header": { >>> "Version": "4", >>> "IHL": "5", >>> "DSCP": "0", >>> "ECN": "0", >>> "Length": "60", >>> "Identification": "55107", >>> "Flags": "0", >>> "FragmentOffset": "0", >>> "TTL": "128", >>> "Protocol": "1", >>> "Checksum": "11123", >>> "IPSrc": { >>> "value": "192.168.158.139" >>> }, >>> "IPDest": { >>> "value": "174.137.42.77" >>> }, >>> "ComputedChecksum": "11123" >>> }, >>> "Protocol": "1", >>> "ICMPv4": { >>> "Type": "8", >>> "Code": "0", >>> "Checksum": "10844", >>> "EchoRequest": { >>> "Identifier": "512", >>> "SequenceNumber": "8448", >>> "Payload": >>> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869" >>> } >>> } >>> } >>> } >>> }
Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?
Ok. It took some time but putting the infoset.json attachment into /tmp this SQL pulls out all the IP addresses from the PCAP data: use dfs.tmp; create or replace view pcapDoc as select PCAP from `infoset.json`; create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as packet from pcapDoc; create or replace view ipv4Headers as select packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from packets ; create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip from ipv4Headers; create or replace view ipdests as select ipv4headers.hdr.IPDest.value as ip from ipv4Headers; create or replace view ips as select ip from ipsrcs union select ip from ipdests; select * from ips; On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle wrote: > ... sound of crickets on a summer night . > > It would really help me if I could get a response to this inquiry, to help > me better understand Drill. > > I do realize people are busy, and also this was originally sent Aug 25, > which is the wrong time of year to get timely response to anything. > Hence, this re-up of the message. > > > On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle > wrote: > >> Below is a small JSON output from Daffodil and below that is the same >> Infoset output as XML. >> (They're inline in this message, but I also attached them as files) >> >> This is just a parse of a small PCAP file with a few ICMP packets in it. >> It's an example DFDL schema used to illustrate binary file parsing. >> >> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this >> component schema: https://github.com/DFDLSchemas/ethernetIP) >> >> My theory is that Drill queries against these should be identical to >> obtain the same output row contents. >> That is, since this data has the same schema, whether it is JSON or XML >> shouldn't affect how you query it. >> To do that the XML Reader will need the XML schema (or some hand-provided >> metadata) so it knows what is an array. (Specifically PCAP.Packet is the >> array.) >> >> E.g., if you wanted to get the IPSrc and IPDest fields in a table from >> all ICMP packets in this file, that query should be the same for the JSON >> and the XML data. >> >> First question: Does that make sense? I want to make sure I'm >> understanding this right. >> >> Second question, since I don't really understand Drill SQL yet. >> >> What is a query that would pluck the IPSrc.value and IPDest.value from >> this data and make a row of each pair of those? >> >> The top level is a map with a single element named PCAP. >> The "table" is PCAP.Packet which is an array (of maps). >> And within each array item's map the fields of interest are within >> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header >> (so maybe IPv4Header is the table?) >> The two fields within there are IPSrc.value (AS src) and IPDest.value (AS >> dest) >> >> I'm lost on how to tell the query that the table is the array >> PCAP.Packet, or the IPv4Header within those maybe? >> >> Maybe this is easy, but I'm just not grokking it yet so I could use some >> help here. >> >> Thanks in advance. >> >> { >> "PCAP": { >> "PCAPHeader": { >> "MagicNumber": "D4C3B2A1", >> "Version": { >> "Major": "2", >> "Minor": "4" >> }, >> "Zone": "0", >> "SigFigs": "0", >> "SnapLen": "65535", >> "Network": "1" >> }, >> "Packet": [ >> { >> "PacketHeader": { >> "Seconds": "1371631556", >> "USeconds": "838904", >> "InclLen": "74", >> "OrigLen": "74" >> }, >> "LinkLayer": { >> "Ethernet": { >> "MACDest": "005056E01449", >> "MACSrc": "000C29340BDE", >> "Ethertype": "2048", >> "NetworkLayer": { >> "IPv4": { >> "IPv4Header": { >> "Version": "4", >> "IHL": "5", >> "DSCP": "0", >> "ECN": "0", >> "Length": "60", >> "Identification": "55107", >> "Flags": "0", >> "FragmentOffset": "0", >> "TTL": "128", >> "Protocol": "1", >> "Checksum": "11123", >> "IPSrc": { >> "value": "192.168.158.139" >> }, >> "IPDest": { >> "value": "174.137.42.77" >> }, >> "ComputedChecksum": "11123" >> }, >> "Protocol": "1", >> "ICMPv4": { >> "Type": "8", >> "Code": "0", >> "Checksum": "10844", >> "EchoRequest": { >> "Identifier": "512", >> "SequenceNumber": "8448", >> "Payload": >> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869" >> } >> } >> } >> } >> } >> } >> }, >> { >> "PacketHeader": { >> "Seconds": "1371631557", >> "USeconds": "55699", >> "InclLen": "74", >> "OrigLen": "74" >> }, >> "LinkLayer": { >> "Ethernet": { >> "MACDest": "000C29340BDE", >> "MACSrc": "005056E01449", >> "Ethertype": "2048", >> "NetworkLayer": { >> "IPv4": { >> "IPv4Header": { >> "Version": "4", >> "IHL": "5", >> "DSCP": "0", >> "ECN": "0", >> "Length": "60", >> "Identification": "30433", >> "Flags": "0", >> "FragmentOffset": "0", >> "TTL": "128", >> "Protocol": "1", >> "Checksum": "35797", >> "IPSrc": { >> "value": "174.137.42.77" >> }, >> "IPDest": { >> "value": "192.168.158.139" >> }, >> "ComputedChecksum": "35797" >> }, >> "Protocol": "1", >> "ICMPv4": { >> "Type": "0", >> "Code": "0", >>
Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)
Nevermind. I figured this out. Was due to 'properties' being a reserved keyword. I created a PR to fix the JSON doc on the drill site. On Sat, Oct 7, 2023 at 1:46 PM Mike Beckerle wrote: > Ok, after weeks of delay > > That helps a great deal. You flatten the array of maps into a table of > maps. > > I am confused still about when I must do square brackets versus dot > notation: data['a'] vs. data.a > The JSON documentation for Drill uses dot notation to reach into fields of > a map. > > Ex: from the JSON doc: > > { > "type": "FeatureCollection", > "features": [ > { > "type": "Feature", > "properties": > { > "MAPBLKLOT": "0001001", > "BLKLOT": "0001001", > "BLOCK_NUM": "0001", > "LOT_NUM": "001", > > > The query uses SELECT features[0].properties.MAPBLKLOT, FROM ... > Which is using dot notation where in your queries on my JSON you did not > use dot notation. > > I tried revising the queries you wrote using the dot notation, and it was > rejected. "no table named 'data'", but I'm not sure why. > > Ex: > > This works: (your original working query) > > SELECT data['a'], data['b'] FROM (select flatten(record) AS data from > dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0; > > But this fails: > > SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from > dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0; > Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108: > Table 'data' not found > > But your sub-select defines 'data' as, I would assume, a table. > > Can you help me clarify this? > > [Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0) > apache drill> > > > On Mon, Sep 18, 2023 at 11:17 PM Charles Givre wrote: > >> Hi Mike, >> Let me answer your question with some queries: >> >> >>> select * from dfs.test.`record.json`; >> >> +--+ >> | record >> | >> >> +--+ >> | >> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}] >> | >> >> +--+ >> >> Now... I can flatten that like this: >> >> >>> select flatten(record) AS data from dfs.test.`record.json`; >> +--+ >> | data | >> +--+ >> | {"a":{"a1":5.0,"a2":6.0},"b":{}} | >> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} | >> | {"a":{"a1":7.0,"a2":8.0},"b":{}} | >> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} | >> +--+ >> 4 rows selected (0.298 seconds) >> >> You asked about filtering. For this, I broke it up into a subquery, but >> here's how I did that: >> >> >>> SELECT data['a'], data['b'] >> 2..semicolon> FROM (select flatten(record) AS data from >> dfs.test.`record.json`) >> 3..semicolon> WHERE data['b']['b1'] > 60.0; >> ++-+ >> | EXPR$0 | EXPR$1 | >> ++-+ >> | {} | {"b1":77.0,"b2":88.0,"b3":99.0} | >> ++-+ >> 1 row selected (0.379 seconds) >> >> I did all this without the union data type. >> >> Does this make sense? >> Best, >> -- C >> >> >> On Sep 13, 2023, at 11:08 AM, Mike Beckerle wrote: >> >> I'm thinking whether a first prototype of DFDL integration to Drill should >> just use JSON. >> >> But please consider this JSON: >> >> { "record": [ >>{ "a": { "a1":5, "a2":6 } }, >>{ "b": { "b1":55, "b2":66, "b3":77 } } >>{ "a": { "a1":7, "a2":8 } }, >>{ "b": { "b1":77, "b2":88, "b3":99 } } >> ] } >> >> It corresponds to this text data file, parsed using Daffodil: >> >>105062556677107082778899 >> >> The file is a stream of records. The first byte is a tag value 1 for type >> 'a' records, and 2 for type 'b' records. >> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and >> a2. They are integers. >> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, >> b2, >> and b3. They are integers. >> This kind of format is very common, even textualized like this (from COBOL >> programs for example) >> >> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ? >> (and ... does this require the experimental Union feature?) >> >> b1, b2 >> - >> (55, 66) >> (77, 88) >> >> I ask because in an XML Schema or DFDL schema choices with dozens of >> 'branches' are very common. >> Ex: schema for the above data: >> >> >> >> >> >> >> >>... many child elements let's say named a1, a2, ... >>
[PR] Fixed JSON example to back-tick keyword `properties` [drill-site]
mbeckerle opened a new pull request, #35: URL: https://github.com/apache/drill-site/pull/35 DRILL-2834 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] json example on drill site doesn't work (drill)
mbeckerle commented on issue #2834: URL: https://github.com/apache/drill/issues/2834#issuecomment-1752028910 It appears 'properties' is a keyword. This works, note the backticks around `properties` ``` select t.feature.`properties` as property from (SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t; select t.feature.`properties` as property from (SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t; +--+ | property | +--+ | {"MAPBLKLOT":"0001001","BLKLOT":"0001001","BLOCK_NUM":"0001","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"} | | {"MAPBLKLOT":"0002001","BLKLOT":"0002001","BLOCK_NUM":"0002","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"} | | {"MAPBLKLOT":"0004002","BLKLOT":"0004002","BLOCK_NUM":"0004","LOT_NUM":"002","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"} | +--+ 3 rows selected (0.157 seconds) apache drill (dfs.tmp)> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] json example on drill site doesn't work (drill)
mbeckerle commented on issue #2834: URL: https://github.com/apache/drill/issues/2834#issuecomment-1752019294 I found this works, but I don't know why the "['properties']" notation is needed vs. just "." ``` select t.feature['properties']['MAPBLKLOT'] as MAPBLKLOT from (SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t; select t.feature['properties']['MAPBLKLOT'] as MAPBLKLOT from (SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t; +---+ | MAPBLKLOT | +---+ | 0001001 | | 0002001 | | 0004002 | +---+ 3 rows selected (0.128 seconds) apache drill> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] json example on drill site doesn't work (drill)
mbeckerle opened a new issue, #2834: URL: https://github.com/apache/drill/issues/2834 Using drill 1.21 The example on page https://drill.apache.org/docs/json-data-model/ of querying the citylots.json does not work: ``` SELECT features[0].properties.MAPBLKLOT, FROM dfs.`/home/mbeckerle/Downloads/citylots.json.gz`; Error: PARSE ERROR: Encountered "." at line 1, column 19. SQL Query: SELECT features[0].properties.MAPBLKLOT, FROM dfs.`/home/mbeckerle/Downloads/citylots.json.gz` ^ [Error Id: bfb6226d-273f-4a9f-a58b-81c9cd32c18c ] (state=,code=0) ``` In addition, I've tried numerous variants of this query to see if I can find out what is wrong, but I can't get anything to work. This much works: ``` SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`; +--+ | feature | +--+ | {"type":"Feature","properties":{"MAPBLKLOT":"0001001","BLKLOT":"0001001","BLOCK_NUM":"0001","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"},"geometry":{"type":"Polygon","coordinates":[[[-122.42200352825247,37.80848009696725,0.0],[-122.42207601332528,37.808835019815085,0.0],[-122.42110217434863,37.808803534992904,0.0],[-122.42106256906727,37.80860105681815,0.0],[-122.42200352825247,37.80848009696725,0.0]]]}} | | {"type":"Feature","properties":{"MAPBLKLOT":"0002001","BLKLOT":"0002001","BLOCK_NUM":"0002","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"},"geometry":{"type":"Polygon","coordinates":[[[-122.42082593937107,37.80863147414603,0.0],[-122.4208580496797,37.80879564136959,0.0],[-122.4198119587043,37.80876180971401,0.0],[-122.42082593937107,37.80863147414603,0.0]]]}} | | {"type":"Feature","properties":{"MAPBLKLOT":"0004002","BLKLOT":"0004002","BLOCK_NUM":"0004","LOT_NUM":"002","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"},"geometry":{"type":"Polygon","coordinates":[[[-122.41570120460688,37.80832725267146,0.0],[-122.4157607435932,37.808630700240904,0.0],[-122.4137878913324,37.80856680131984,0.0],[-122.41570120460688,37.80832725267146,0.0]]]}} | +--+ ``` But if I then try this in a sub-select to try to dig out the MAPBLKLOT ``` select t.properties.MAPBLKLOT from (SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t; select t.properties.MAPBLKLOT from (SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t; Error: PARSE ERROR: Encountered "." at line 1, column 9. SQL Query: select t.properties.MAPBLKLOT from (SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t ^ [Error Id: 49670771-e7d5-41c7-9ede-9fa42e3ff261 ] (state=,code=0) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org