Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-10-08 Thread Mike Beckerle
Or this single query:

with pcapDoc as (select PCAP from `infoset.json`),
 packets as (select flatten(pcapDoc.PCAP.Packet) as packet from
pcapDoc),
 ipv4Headers as (select
packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
packets),
 ipsrcs as (select ipv4headers.hdr.IPSrc.value as ip from ipv4Headers),
 ipdests as (select ipv4headers.hdr.IPDest.value as ip from
ipv4Headers),
 ips as (select ip from ipsrcs union select ip from ipdests)
select * from ips;


On Sun, Oct 8, 2023 at 2:47 PM Mike Beckerle  wrote:

> Ok. It took some time but putting the infoset.json attachment into /tmp
> this SQL pulls out all the IP addresses from the PCAP data:
>
> use dfs.tmp;
>
> create or replace view pcapDoc as select PCAP from `infoset.json`;
>
> create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as
> packet from pcapDoc;
>
> create or replace view ipv4Headers as select
> packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
> packets ;
>
> create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip
> from ipv4Headers;
>
> create or replace view ipdests as select ipv4headers.hdr.IPDest.value as
> ip from ipv4Headers;
>
> create or replace view ips as select ip from ipsrcs union select ip from
> ipdests;
>
> select * from ips;
>
> On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle 
> wrote:
>
>> ... sound of crickets on a summer night .
>>
>> It would really help me if I could get a response to this inquiry, to
>> help me better understand Drill.
>>
>> I do realize people are busy, and also this was originally sent Aug 25,
>> which is the wrong time of year to get timely response to anything.
>> Hence, this re-up of the message.
>>
>>
>> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
>> wrote:
>>
>>> Below is a small JSON output from Daffodil and below that is the same
>>> Infoset output as XML.
>>> (They're inline in this message, but I also attached them as files)
>>>
>>> This is just a parse of a small PCAP file with a few ICMP packets in it.
>>> It's an example DFDL schema used to illustrate binary file parsing.
>>>
>>> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
>>> component schema: https://github.com/DFDLSchemas/ethernetIP)
>>>
>>> My theory is that Drill queries against these should be identical to
>>> obtain the same output row contents.
>>> That is, since this data has the same schema, whether it is JSON or XML
>>> shouldn't affect how you query it.
>>> To do that the XML Reader will need the XML schema (or some
>>> hand-provided metadata) so it knows what is an array. (Specifically
>>> PCAP.Packet is the array.)
>>>
>>> E.g., if you wanted to get the IPSrc and IPDest fields in a table from
>>> all ICMP packets in this file, that query should be the same for the JSON
>>> and the XML data.
>>>
>>> First question: Does that make sense? I want to make sure I'm
>>> understanding this right.
>>>
>>> Second question, since I don't really understand Drill SQL yet.
>>>
>>> What is a query that would pluck the IPSrc.value and IPDest.value from
>>> this data and make a row of each pair of those?
>>>
>>> The top level is a map with a single element named PCAP.
>>> The "table" is PCAP.Packet which is an array (of maps).
>>> And within each array item's map the fields of interest are within
>>> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
>>> (so maybe IPv4Header is the table?)
>>> The two fields within there are IPSrc.value (AS src) and IPDest.value
>>> (AS dest)
>>>
>>> I'm lost on how to tell the query that the table is the array
>>> PCAP.Packet, or the IPv4Header within those maybe?
>>>
>>> Maybe this is easy, but I'm just not grokking it yet so I could use some
>>> help here.
>>>
>>> Thanks in advance.
>>>
>>> {
>>> "PCAP": {
>>> "PCAPHeader": {
>>> "MagicNumber": "D4C3B2A1",
>>> "Version": {
>>> "Major": "2",
>>> "Minor": "4"
>>> },
>>> "Zone": "0",
>>> "SigFigs": "0",
>>> "SnapLen": "65535",
>>> "Network": "1"
>>> },
>>> "Packet": [
>>> {
>>> "PacketHeader": {
>>> "Seconds": "1371631556",
>>> "USeconds": "838904",
>>> "InclLen": "74",
>>> "OrigLen": "74"
>>> },
>>> "LinkLayer": {
>>> "Ethernet": {
>>> "MACDest": "005056E01449",
>>> "MACSrc": "000C29340BDE",
>>> "Ethertype": "2048",
>>> "NetworkLayer": {
>>> "IPv4": {
>>> "IPv4Header": {
>>> "Version": "4",
>>> "IHL": "5",
>>> "DSCP": "0",
>>> "ECN": "0",
>>> "Length": "60",
>>> "Identification": "55107",
>>> "Flags": "0",
>>> "FragmentOffset": "0",
>>> "TTL": "128",
>>> "Protocol": "1",
>>> "Checksum": "11123",
>>> "IPSrc": {
>>> "value": "192.168.158.139"
>>> },
>>> "IPDest": {
>>> "value": "174.137.42.77"
>>> },
>>> "ComputedChecksum": "11123"
>>> },
>>> "Protocol": "1",
>>> "ICMPv4": {
>>> "Type": "8",
>>> "Code": "0",
>>> "Checksum": "10844",
>>> "EchoRequest": {
>>> "Identifier": "512",
>>> "SequenceNumber": "8448",
>>> "Payload":
>>> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
>>> }
>>> }
>>> }
>>> }
>>> }

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-10-08 Thread Mike Beckerle
Ok. It took some time but putting the infoset.json attachment into /tmp
this SQL pulls out all the IP addresses from the PCAP data:

use dfs.tmp;

create or replace view pcapDoc as select PCAP from `infoset.json`;

create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as
packet from pcapDoc;

create or replace view ipv4Headers as select
packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
packets ;

create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip
from ipv4Headers;

create or replace view ipdests as select ipv4headers.hdr.IPDest.value as ip
from ipv4Headers;

create or replace view ips as select ip from ipsrcs union select ip from
ipdests;

select * from ips;

On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle  wrote:

> ... sound of crickets on a summer night .
>
> It would really help me if I could get a response to this inquiry, to help
> me better understand Drill.
>
> I do realize people are busy, and also this was originally sent Aug 25,
> which is the wrong time of year to get timely response to anything.
> Hence, this re-up of the message.
>
>
> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
> wrote:
>
>> Below is a small JSON output from Daffodil and below that is the same
>> Infoset output as XML.
>> (They're inline in this message, but I also attached them as files)
>>
>> This is just a parse of a small PCAP file with a few ICMP packets in it.
>> It's an example DFDL schema used to illustrate binary file parsing.
>>
>> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
>> component schema: https://github.com/DFDLSchemas/ethernetIP)
>>
>> My theory is that Drill queries against these should be identical to
>> obtain the same output row contents.
>> That is, since this data has the same schema, whether it is JSON or XML
>> shouldn't affect how you query it.
>> To do that the XML Reader will need the XML schema (or some hand-provided
>> metadata) so it knows what is an array. (Specifically PCAP.Packet is the
>> array.)
>>
>> E.g., if you wanted to get the IPSrc and IPDest fields in a table from
>> all ICMP packets in this file, that query should be the same for the JSON
>> and the XML data.
>>
>> First question: Does that make sense? I want to make sure I'm
>> understanding this right.
>>
>> Second question, since I don't really understand Drill SQL yet.
>>
>> What is a query that would pluck the IPSrc.value and IPDest.value from
>> this data and make a row of each pair of those?
>>
>> The top level is a map with a single element named PCAP.
>> The "table" is PCAP.Packet which is an array (of maps).
>> And within each array item's map the fields of interest are within
>> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
>> (so maybe IPv4Header is the table?)
>> The two fields within there are IPSrc.value (AS src) and IPDest.value (AS
>> dest)
>>
>> I'm lost on how to tell the query that the table is the array
>> PCAP.Packet, or the IPv4Header within those maybe?
>>
>> Maybe this is easy, but I'm just not grokking it yet so I could use some
>> help here.
>>
>> Thanks in advance.
>>
>> {
>> "PCAP": {
>> "PCAPHeader": {
>> "MagicNumber": "D4C3B2A1",
>> "Version": {
>> "Major": "2",
>> "Minor": "4"
>> },
>> "Zone": "0",
>> "SigFigs": "0",
>> "SnapLen": "65535",
>> "Network": "1"
>> },
>> "Packet": [
>> {
>> "PacketHeader": {
>> "Seconds": "1371631556",
>> "USeconds": "838904",
>> "InclLen": "74",
>> "OrigLen": "74"
>> },
>> "LinkLayer": {
>> "Ethernet": {
>> "MACDest": "005056E01449",
>> "MACSrc": "000C29340BDE",
>> "Ethertype": "2048",
>> "NetworkLayer": {
>> "IPv4": {
>> "IPv4Header": {
>> "Version": "4",
>> "IHL": "5",
>> "DSCP": "0",
>> "ECN": "0",
>> "Length": "60",
>> "Identification": "55107",
>> "Flags": "0",
>> "FragmentOffset": "0",
>> "TTL": "128",
>> "Protocol": "1",
>> "Checksum": "11123",
>> "IPSrc": {
>> "value": "192.168.158.139"
>> },
>> "IPDest": {
>> "value": "174.137.42.77"
>> },
>> "ComputedChecksum": "11123"
>> },
>> "Protocol": "1",
>> "ICMPv4": {
>> "Type": "8",
>> "Code": "0",
>> "Checksum": "10844",
>> "EchoRequest": {
>> "Identifier": "512",
>> "SequenceNumber": "8448",
>> "Payload":
>> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
>> }
>> }
>> }
>> }
>> }
>> }
>> },
>> {
>> "PacketHeader": {
>> "Seconds": "1371631557",
>> "USeconds": "55699",
>> "InclLen": "74",
>> "OrigLen": "74"
>> },
>> "LinkLayer": {
>> "Ethernet": {
>> "MACDest": "000C29340BDE",
>> "MACSrc": "005056E01449",
>> "Ethertype": "2048",
>> "NetworkLayer": {
>> "IPv4": {
>> "IPv4Header": {
>> "Version": "4",
>> "IHL": "5",
>> "DSCP": "0",
>> "ECN": "0",
>> "Length": "60",
>> "Identification": "30433",
>> "Flags": "0",
>> "FragmentOffset": "0",
>> "TTL": "128",
>> "Protocol": "1",
>> "Checksum": "35797",
>> "IPSrc": {
>> "value": "174.137.42.77"
>> },
>> "IPDest": {
>> "value": "192.168.158.139"
>> },
>> "ComputedChecksum": "35797"
>> },
>> "Protocol": "1",
>> "ICMPv4": {
>> "Type": "0",
>> "Code": "0",
>> 

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-10-08 Thread Mike Beckerle
Nevermind. I figured this out. Was due to 'properties' being a reserved
keyword. I created a PR to fix the JSON doc on the drill site.

On Sat, Oct 7, 2023 at 1:46 PM Mike Beckerle  wrote:

> Ok, after weeks of delay
>
> That helps a great deal. You flatten the array of maps into a table of
> maps.
>
> I am confused still about when I must do square brackets versus dot
> notation: data['a'] vs. data.a
> The JSON documentation for Drill uses dot notation to reach into fields of
> a map.
>
> Ex: from the JSON doc:
>
> {
>   "type": "FeatureCollection",
>   "features": [
>   {
> "type": "Feature",
> "properties":
> {
>   "MAPBLKLOT": "0001001",
>   "BLKLOT": "0001001",
>   "BLOCK_NUM": "0001",
>   "LOT_NUM": "001",
>
>
> The query uses SELECT features[0].properties.MAPBLKLOT, FROM ...
> Which is using dot notation where in your queries on my JSON you did not
> use dot notation.
>
> I tried revising the queries you wrote using the dot notation, and it was
> rejected. "no table named 'data'", but I'm not sure why.
>
> Ex:
>
> This works: (your original working query)
>
> SELECT data['a'], data['b'] FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0;
>
> But this fails:
>
> SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0;
> Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108:
> Table 'data' not found
>
> But your sub-select defines 'data' as, I would assume, a table.
>
> Can you help me clarify this?
>
> [Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0)
> apache drill>
>
>
> On Mon, Sep 18, 2023 at 11:17 PM Charles Givre  wrote:
>
>> Hi Mike,
>> Let me answer your question with some queries:
>>
>>  >>> select * from dfs.test.`record.json`;
>>
>> +--+
>> |  record
>>  |
>>
>> +--+
>> |
>> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
>> |
>>
>> +--+
>>
>> Now... I can flatten that like this:
>>
>> >>> select flatten(record) AS data from dfs.test.`record.json`;
>> +--+
>> | data |
>> +--+
>> | {"a":{"a1":5.0,"a2":6.0},"b":{}} |
>> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
>> | {"a":{"a1":7.0,"a2":8.0},"b":{}} |
>> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
>> +--+
>> 4 rows selected (0.298 seconds)
>>
>> You asked about filtering.   For this, I broke it up into a subquery, but
>> here's how I did that:
>>
>> >>> SELECT data['a'], data['b']
>> 2..semicolon> FROM (select flatten(record) AS data from
>> dfs.test.`record.json`)
>> 3..semicolon> WHERE data['b']['b1'] > 60.0;
>> ++-+
>> | EXPR$0 | EXPR$1  |
>> ++-+
>> | {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
>> ++-+
>> 1 row selected (0.379 seconds)
>>
>> I did all this without the union data type.
>>
>> Does this make sense?
>> Best,
>> -- C
>>
>>
>> On Sep 13, 2023, at 11:08 AM, Mike Beckerle  wrote:
>>
>> I'm thinking whether a first prototype of DFDL integration to Drill should
>> just use JSON.
>>
>> But please consider this JSON:
>>
>> { "record": [
>>{ "a": { "a1":5, "a2":6 } },
>>{ "b": { "b1":55, "b2":66, "b3":77 } }
>>{ "a": { "a1":7, "a2":8 } },
>>{ "b": { "b1":77, "b2":88, "b3":99 } }
>>  ] }
>>
>> It corresponds to this text data file, parsed using Daffodil:
>>
>>105062556677107082778899
>>
>> The file is a stream of records. The first byte is a tag value 1 for type
>> 'a' records, and 2 for type 'b' records.
>> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
>> a2. They are integers.
>> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1,
>> b2,
>> and b3. They are integers.
>> This kind of format is very common, even textualized like this (from COBOL
>> programs for example)
>>
>> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
>> (and ... does this require the experimental Union feature?)
>>
>> b1, b2
>> -
>> (55, 66)
>> (77, 88)
>>
>> I ask because in an XML Schema or DFDL schema choices with dozens of
>> 'branches' are very common.
>> Ex: schema for the above data:
>>
>> 
>>   
>>  
>>  
>>   
>>
>>... many child elements let's say named a1, a2, ...
>>

[PR] Fixed JSON example to back-tick keyword `properties` [drill-site]

2023-10-08 Thread via GitHub


mbeckerle opened a new pull request, #35:
URL: https://github.com/apache/drill-site/pull/35

   DRILL-2834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] json example on drill site doesn't work (drill)

2023-10-08 Thread via GitHub


mbeckerle commented on issue #2834:
URL: https://github.com/apache/drill/issues/2834#issuecomment-1752028910

   It appears 'properties' is a keyword. This works, note the backticks around 
`properties`
   ```
   select t.feature.`properties` as property from (SELECT flatten(features) as 
feature FROM dfs.`/tmp/citydata.json`) t;
   select t.feature.`properties` as property from (SELECT flatten(features) as 
feature FROM dfs.`/tmp/citydata.json`) t;
   
+--+
   | property   
  |
   
+--+
   | 
{"MAPBLKLOT":"0001001","BLKLOT":"0001001","BLOCK_NUM":"0001","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"}
 |
   | 
{"MAPBLKLOT":"0002001","BLKLOT":"0002001","BLOCK_NUM":"0002","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"}
 |
   | 
{"MAPBLKLOT":"0004002","BLKLOT":"0004002","BLOCK_NUM":"0004","LOT_NUM":"002","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"}
 |
   
+--+
   3 rows selected (0.157 seconds)
   apache drill (dfs.tmp)> 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] json example on drill site doesn't work (drill)

2023-10-08 Thread via GitHub


mbeckerle commented on issue #2834:
URL: https://github.com/apache/drill/issues/2834#issuecomment-1752019294

   I found this works, but I don't know why the "['properties']" notation is 
needed vs. just "."
   ```
   select t.feature['properties']['MAPBLKLOT'] as MAPBLKLOT from (SELECT 
flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t;
   select t.feature['properties']['MAPBLKLOT'] as MAPBLKLOT from (SELECT 
flatten(features) as feature FROM dfs.`/tmp/citydata.json`) t;
   +---+
   | MAPBLKLOT |
   +---+
   | 0001001   |
   | 0002001   |
   | 0004002   |
   +---+
   3 rows selected (0.128 seconds)
   apache drill> 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] json example on drill site doesn't work (drill)

2023-10-08 Thread via GitHub


mbeckerle opened a new issue, #2834:
URL: https://github.com/apache/drill/issues/2834

   Using drill 1.21
   
   The example on page https://drill.apache.org/docs/json-data-model/ of 
querying the citylots.json does not work:
   
   ```
   SELECT features[0].properties.MAPBLKLOT, FROM 
dfs.`/home/mbeckerle/Downloads/citylots.json.gz`;
   Error: PARSE ERROR: Encountered "." at line 1, column 19.
   
   SQL Query: SELECT features[0].properties.MAPBLKLOT, FROM 
dfs.`/home/mbeckerle/Downloads/citylots.json.gz`
^
   
   [Error Id: bfb6226d-273f-4a9f-a58b-81c9cd32c18c ] (state=,code=0)
   ```
   
   In addition, I've tried numerous variants of this query to see if I can find 
out what is wrong, but I can't get anything to work. 
   
   This much works:
   ```
   SELECT flatten(features) as feature FROM dfs.`/tmp/citydata.json`;
   
+--+
   | feature
  |
   
+--+
   | 
{"type":"Feature","properties":{"MAPBLKLOT":"0001001","BLKLOT":"0001001","BLOCK_NUM":"0001","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"},"geometry":{"type":"Polygon","coordinates":[[[-122.42200352825247,37.80848009696725,0.0],[-122.42207601332528,37.808835019815085,0.0],[-122.42110217434863,37.808803534992904,0.0],[-122.42106256906727,37.80860105681815,0.0],[-122.42200352825247,37.80848009696725,0.0]]]}}
 |
   | 
{"type":"Feature","properties":{"MAPBLKLOT":"0002001","BLKLOT":"0002001","BLOCK_NUM":"0002","LOT_NUM":"001","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"},"geometry":{"type":"Polygon","coordinates":[[[-122.42082593937107,37.80863147414603,0.0],[-122.4208580496797,37.80879564136959,0.0],[-122.4198119587043,37.80876180971401,0.0],[-122.42082593937107,37.80863147414603,0.0]]]}}
 |
   | 
{"type":"Feature","properties":{"MAPBLKLOT":"0004002","BLKLOT":"0004002","BLOCK_NUM":"0004","LOT_NUM":"002","FROM_ST":"0","TO_ST":"0","STREET":"UNKNOWN","ODD_EVEN":"E"},"geometry":{"type":"Polygon","coordinates":[[[-122.41570120460688,37.80832725267146,0.0],[-122.4157607435932,37.808630700240904,0.0],[-122.4137878913324,37.80856680131984,0.0],[-122.41570120460688,37.80832725267146,0.0]]]}}
 |
   
+--+
   ```
   But if I then try this in a sub-select to try to dig out the MAPBLKLOT
   ```
select t.properties.MAPBLKLOT from (SELECT flatten(features) as feature 
FROM dfs.`/tmp/citydata.json`) t;
   
   select t.properties.MAPBLKLOT from (SELECT flatten(features) as feature FROM 
dfs.`/tmp/citydata.json`) t;
   Error: PARSE ERROR: Encountered "." at line 1, column 9.
   
   SQL Query: select t.properties.MAPBLKLOT from (SELECT flatten(features) as 
feature FROM dfs.`/tmp/citydata.json`) t
  ^
   
   [Error Id: 49670771-e7d5-41c7-9ede-9fa42e3ff261 ] (state=,code=0)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org