Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-10-08 Thread Mike Beckerle
Or this single query:

with pcapDoc as (select PCAP from `infoset.json`),
 packets as (select flatten(pcapDoc.PCAP.Packet) as packet from
pcapDoc),
 ipv4Headers as (select
packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
packets),
 ipsrcs as (select ipv4headers.hdr.IPSrc.value as ip from ipv4Headers),
 ipdests as (select ipv4headers.hdr.IPDest.value as ip from
ipv4Headers),
 ips as (select ip from ipsrcs union select ip from ipdests)
select * from ips;


On Sun, Oct 8, 2023 at 2:47 PM Mike Beckerle  wrote:

> Ok. It took some time but putting the infoset.json attachment into /tmp
> this SQL pulls out all the IP addresses from the PCAP data:
>
> use dfs.tmp;
>
> create or replace view pcapDoc as select PCAP from `infoset.json`;
>
> create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as
> packet from pcapDoc;
>
> create or replace view ipv4Headers as select
> packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
> packets ;
>
> create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip
> from ipv4Headers;
>
> create or replace view ipdests as select ipv4headers.hdr.IPDest.value as
> ip from ipv4Headers;
>
> create or replace view ips as select ip from ipsrcs union select ip from
> ipdests;
>
> select * from ips;
>
> On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle 
> wrote:
>
>> ... sound of crickets on a summer night .
>>
>> It would really help me if I could get a response to this inquiry, to
>> help me better understand Drill.
>>
>> I do realize people are busy, and also this was originally sent Aug 25,
>> which is the wrong time of year to get timely response to anything.
>> Hence, this re-up of the message.
>>
>>
>> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
>> wrote:
>>
>>> Below is a small JSON output from Daffodil and below that is the same
>>> Infoset output as XML.
>>> (They're inline in this message, but I also attached them as files)
>>>
>>> This is just a parse of a small PCAP file with a few ICMP packets in it.
>>> It's an example DFDL schema used to illustrate binary file parsing.
>>>
>>> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
>>> component schema: https://github.com/DFDLSchemas/ethernetIP)
>>>
>>> My theory is that Drill queries against these should be identical to
>>> obtain the same output row contents.
>>> That is, since this data has the same schema, whether it is JSON or XML
>>> shouldn't affect how you query it.
>>> To do that the XML Reader will need the XML schema (or some
>>> hand-provided metadata) so it knows what is an array. (Specifically
>>> PCAP.Packet is the array.)
>>>
>>> E.g., if you wanted to get the IPSrc and IPDest fields in a table from
>>> all ICMP packets in this file, that query should be the same for the JSON
>>> and the XML data.
>>>
>>> First question: Does that make sense? I want to make sure I'm
>>> understanding this right.
>>>
>>> Second question, since I don't really understand Drill SQL yet.
>>>
>>> What is a query that would pluck the IPSrc.value and IPDest.value from
>>> this data and make a row of each pair of those?
>>>
>>> The top level is a map with a single element named PCAP.
>>> The "table" is PCAP.Packet which is an array (of maps).
>>> And within each array item's map the fields of interest are within
>>> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
>>> (so maybe IPv4Header is the table?)
>>> The two fields within there are IPSrc.value (AS src) and IPDest.value
>>> (AS dest)
>>>
>>> I'm lost on how to tell the query that the table is the array
>>> PCAP.Packet, or the IPv4Header within those maybe?
>>>
>>> Maybe this is easy, but I'm just not grokking it yet so I could use some
>>> help here.
>>>
>>> Thanks in advance.
>>>
>>> {
>>> "PCAP": {
>>> "PCAPHeader": {
>>> "MagicNumber": "D4C3B2A1",
>>> "Version": {
>>> "Major": "2",
>>> "Minor": "4"
>>> },
>>> "Zone": "0",
>>> "SigFigs": "0",
>>> "SnapLen": "65535",
>>> "Network": "1"
>>> },
>>> "Packet": [
>>> {
>>> "PacketHeader": {
>>> "Seconds": "1371631556",
>>> "USeconds": "838904",
>>> "InclLen": "74",
>>> "OrigLen": "74"
>>> },
>>> "LinkLayer": {
>>> "Ethernet": {
>>> "MACDest": "005056E01449",
>>> "MACSrc": "000C29340BDE",
>>> "Ethertype": "2048",
>>> "NetworkLayer": {
>>> "IPv4": {
>>> "IPv4Header": {
>>> "Version": "4",
>>> "IHL": "5",
>>> "DSCP": "0",
>>> "ECN": "0",
>>> "Length": "60",
>>> "Identification": "55107",
>>> "Flags": "0",
>>> "FragmentOffset": "0",
>>> "TTL": "128",
>>> "Protocol": "1",
>>> "Checksum": "11123",
>>> "IPSrc": {
>>> "value": "192.168.158.139"
>>> },
>>> "IPDest": {
>>> "value": "174.137.42.77"
>>> },
>>> "ComputedChecksum": "11123"
>>> },
>>> "Protocol": "1",
>>> "ICMPv4": {
>>> "Type": "8",
>>> "Code": "0",
>>> "Checksum": "10844",
>>> "EchoRequest": {
>>> "Identifier": "512",
>>> "SequenceNumber": "8448",
>>> "Payload":
>>> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
>>> }
>>> }
>>> }
>>> }
>>> }

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-10-08 Thread Mike Beckerle
Ok. It took some time but putting the infoset.json attachment into /tmp
this SQL pulls out all the IP addresses from the PCAP data:

use dfs.tmp;

create or replace view pcapDoc as select PCAP from `infoset.json`;

create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as
packet from pcapDoc;

create or replace view ipv4Headers as select
packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
packets ;

create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip
from ipv4Headers;

create or replace view ipdests as select ipv4headers.hdr.IPDest.value as ip
from ipv4Headers;

create or replace view ips as select ip from ipsrcs union select ip from
ipdests;

select * from ips;

On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle  wrote:

> ... sound of crickets on a summer night .
>
> It would really help me if I could get a response to this inquiry, to help
> me better understand Drill.
>
> I do realize people are busy, and also this was originally sent Aug 25,
> which is the wrong time of year to get timely response to anything.
> Hence, this re-up of the message.
>
>
> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
> wrote:
>
>> Below is a small JSON output from Daffodil and below that is the same
>> Infoset output as XML.
>> (They're inline in this message, but I also attached them as files)
>>
>> This is just a parse of a small PCAP file with a few ICMP packets in it.
>> It's an example DFDL schema used to illustrate binary file parsing.
>>
>> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
>> component schema: https://github.com/DFDLSchemas/ethernetIP)
>>
>> My theory is that Drill queries against these should be identical to
>> obtain the same output row contents.
>> That is, since this data has the same schema, whether it is JSON or XML
>> shouldn't affect how you query it.
>> To do that the XML Reader will need the XML schema (or some hand-provided
>> metadata) so it knows what is an array. (Specifically PCAP.Packet is the
>> array.)
>>
>> E.g., if you wanted to get the IPSrc and IPDest fields in a table from
>> all ICMP packets in this file, that query should be the same for the JSON
>> and the XML data.
>>
>> First question: Does that make sense? I want to make sure I'm
>> understanding this right.
>>
>> Second question, since I don't really understand Drill SQL yet.
>>
>> What is a query that would pluck the IPSrc.value and IPDest.value from
>> this data and make a row of each pair of those?
>>
>> The top level is a map with a single element named PCAP.
>> The "table" is PCAP.Packet which is an array (of maps).
>> And within each array item's map the fields of interest are within
>> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
>> (so maybe IPv4Header is the table?)
>> The two fields within there are IPSrc.value (AS src) and IPDest.value (AS
>> dest)
>>
>> I'm lost on how to tell the query that the table is the array
>> PCAP.Packet, or the IPv4Header within those maybe?
>>
>> Maybe this is easy, but I'm just not grokking it yet so I could use some
>> help here.
>>
>> Thanks in advance.
>>
>> {
>> "PCAP": {
>> "PCAPHeader": {
>> "MagicNumber": "D4C3B2A1",
>> "Version": {
>> "Major": "2",
>> "Minor": "4"
>> },
>> "Zone": "0",
>> "SigFigs": "0",
>> "SnapLen": "65535",
>> "Network": "1"
>> },
>> "Packet": [
>> {
>> "PacketHeader": {
>> "Seconds": "1371631556",
>> "USeconds": "838904",
>> "InclLen": "74",
>> "OrigLen": "74"
>> },
>> "LinkLayer": {
>> "Ethernet": {
>> "MACDest": "005056E01449",
>> "MACSrc": "000C29340BDE",
>> "Ethertype": "2048",
>> "NetworkLayer": {
>> "IPv4": {
>> "IPv4Header": {
>> "Version": "4",
>> "IHL": "5",
>> "DSCP": "0",
>> "ECN": "0",
>> "Length": "60",
>> "Identification": "55107",
>> "Flags": "0",
>> "FragmentOffset": "0",
>> "TTL": "128",
>> "Protocol": "1",
>> "Checksum": "11123",
>> "IPSrc": {
>> "value": "192.168.158.139"
>> },
>> "IPDest": {
>> "value": "174.137.42.77"
>> },
>> "ComputedChecksum": "11123"
>> },
>> "Protocol": "1",
>> "ICMPv4": {
>> "Type": "8",
>> "Code": "0",
>> "Checksum": "10844",
>> "EchoRequest": {
>> "Identifier": "512",
>> "SequenceNumber": "8448",
>> "Payload":
>> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
>> }
>> }
>> }
>> }
>> }
>> }
>> },
>> {
>> "PacketHeader": {
>> "Seconds": "1371631557",
>> "USeconds": "55699",
>> "InclLen": "74",
>> "OrigLen": "74"
>> },
>> "LinkLayer": {
>> "Ethernet": {
>> "MACDest": "000C29340BDE",
>> "MACSrc": "005056E01449",
>> "Ethertype": "2048",
>> "NetworkLayer": {
>> "IPv4": {
>> "IPv4Header": {
>> "Version": "4",
>> "IHL": "5",
>> "DSCP": "0",
>> "ECN": "0",
>> "Length": "60",
>> "Identification": "30433",
>> "Flags": "0",
>> "FragmentOffset": "0",
>> "TTL": "128",
>> "Protocol": "1",
>> "Checksum": "35797",
>> "IPSrc": {
>> "value": "174.137.42.77"
>> },
>> "IPDest": {
>> "value": "192.168.158.139"
>> },
>> "ComputedChecksum": "35797"
>> },
>> "Protocol": "1",
>> "ICMPv4": {
>> "Type": "0",
>> "Code": "0",
>> 

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-09-13 Thread Paul Rogers
Hi Mike,

I believe I sent a detailed response to this. Did it get through? If not,
I'll try sending it again...

- Paul

On Wed, Sep 13, 2023 at 6:44 AM Mike Beckerle  wrote:

> ... sound of crickets on a summer night .
>
> It would really help me if I could get a response to this inquiry, to help
> me better understand Drill.
>
> I do realize people are busy, and also this was originally sent Aug 25,
> which is the wrong time of year to get timely response to anything.
> Hence, this re-up of the message.
>
>
> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
> wrote:
>
> > Below is a small JSON output from Daffodil and below that is the same
> > Infoset output as XML.
> > (They're inline in this message, but I also attached them as files)
> >
> > This is just a parse of a small PCAP file with a few ICMP packets in it.
> > It's an example DFDL schema used to illustrate binary file parsing.
> >
> > (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
> > component schema: https://github.com/DFDLSchemas/ethernetIP)
> >
> > My theory is that Drill queries against these should be identical to
> > obtain the same output row contents.
> > That is, since this data has the same schema, whether it is JSON or XML
> > shouldn't affect how you query it.
> > To do that the XML Reader will need the XML schema (or some hand-provided
> > metadata) so it knows what is an array. (Specifically PCAP.Packet is the
> > array.)
> >
> > E.g., if you wanted to get the IPSrc and IPDest fields in a table from
> all
> > ICMP packets in this file, that query should be the same for the JSON and
> > the XML data.
> >
> > First question: Does that make sense? I want to make sure I'm
> > understanding this right.
> >
> > Second question, since I don't really understand Drill SQL yet.
> >
> > What is a query that would pluck the IPSrc.value and IPDest.value from
> > this data and make a row of each pair of those?
> >
> > The top level is a map with a single element named PCAP.
> > The "table" is PCAP.Packet which is an array (of maps).
> > And within each array item's map the fields of interest are within
> > LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
> > (so maybe IPv4Header is the table?)
> > The two fields within there are IPSrc.value (AS src) and IPDest.value (AS
> > dest)
> >
> > I'm lost on how to tell the query that the table is the array
> PCAP.Packet,
> > or the IPv4Header within those maybe?
> >
> > Maybe this is easy, but I'm just not grokking it yet so I could use some
> > help here.
> >
> > Thanks in advance.
> >
> > {
> > "PCAP": {
> > "PCAPHeader": {
> > "MagicNumber": "D4C3B2A1",
> > "Version": {
> > "Major": "2",
> > "Minor": "4"
> > },
> > "Zone": "0",
> > "SigFigs": "0",
> > "SnapLen": "65535",
> > "Network": "1"
> > },
> > "Packet": [
> > {
> > "PacketHeader": {
> > "Seconds": "1371631556",
> > "USeconds": "838904",
> > "InclLen": "74",
> > "OrigLen": "74"
> > },
> > "LinkLayer": {
> > "Ethernet": {
> > "MACDest": "005056E01449",
> > "MACSrc": "000C29340BDE",
> > "Ethertype": "2048",
> > "NetworkLayer": {
> > "IPv4": {
> > "IPv4Header": {
> > "Version": "4",
> > "IHL": "5",
> > "DSCP": "0",
> > "ECN": "0",
> > "Length": "60",
> > "Identification": "55107",
> > "Flags": "0",
> > "FragmentOffset": "0",
> > "TTL": "128",
> > "Protocol": "1",
> > "Checksum": "11123",
> > "IPSrc": {
> > "value": "192.168.158.139"
> > },
> > "IPDest": {
> > "value": "174.137.42.77"
> > },
> > "ComputedChecksum": "11123"
> > },
> > "Protocol": "1",
> > "ICMPv4": {
> > "Type": "8",
> > "Code": "0",
> > "Checksum": "10844",
> > "EchoRequest": {
> > "Identifier": "512",
> > "SequenceNumber": "8448",
> > "Payload":
> > "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
> > }
> > }
> > }
> > }
> > }
> > }
> > },
> > {
> > "PacketHeader": {
> > "Seconds": "1371631557",
> > "USeconds": "55699",
> > "InclLen": "74",
> > "OrigLen": "74"
> > },
> > "LinkLayer": {
> > "Ethernet": {
> > "MACDest": "000C29340BDE",
> > "MACSrc": "005056E01449",
> > "Ethertype": "2048",
> > "NetworkLayer": {
> > "IPv4": {
> > "IPv4Header": {
> > "Version": "4",
> > "IHL": "5",
> > "DSCP": "0",
> > "ECN": "0",
> > "Length": "60",
> > "Identification": "30433",
> > "Flags": "0",
> > "FragmentOffset": "0",
> > "TTL": "128",
> > "Protocol": "1",
> > "Checksum": "35797",
> > "IPSrc": {
> > "value": "174.137.42.77"
> > },
> > "IPDest": {
> > "value": "192.168.158.139"
> > },
> > "ComputedChecksum": "35797"
> > },
> > "Protocol": "1",
> > "ICMPv4": {
> > "Type": "0",
> > "Code": "0",
> > "Checksum": "12892",
> > "EchoReply": {
> > "Identifier": "512",
> > "SequenceNumber": "8448",
> > "Payload":
> > "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
> > }
> > }
> > }
> > }
> > }
> > }
> > },
> > {
> > "PacketHeader": {
> > "Seconds": "1371631557",
> > "USeconds": "840049",
> > "InclLen": "74",
> > "OrigLen": "74"
> > },
> > "LinkLayer": {
> > "Ethernet": {
> > "MACDest": "005056E01449",
> > "MACSrc": "000C29340BDE",
> 

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-09-13 Thread Mike Beckerle
... sound of crickets on a summer night .

It would really help me if I could get a response to this inquiry, to help
me better understand Drill.

I do realize people are busy, and also this was originally sent Aug 25,
which is the wrong time of year to get timely response to anything.
Hence, this re-up of the message.


On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle  wrote:

> Below is a small JSON output from Daffodil and below that is the same
> Infoset output as XML.
> (They're inline in this message, but I also attached them as files)
>
> This is just a parse of a small PCAP file with a few ICMP packets in it.
> It's an example DFDL schema used to illustrate binary file parsing.
>
> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
> component schema: https://github.com/DFDLSchemas/ethernetIP)
>
> My theory is that Drill queries against these should be identical to
> obtain the same output row contents.
> That is, since this data has the same schema, whether it is JSON or XML
> shouldn't affect how you query it.
> To do that the XML Reader will need the XML schema (or some hand-provided
> metadata) so it knows what is an array. (Specifically PCAP.Packet is the
> array.)
>
> E.g., if you wanted to get the IPSrc and IPDest fields in a table from all
> ICMP packets in this file, that query should be the same for the JSON and
> the XML data.
>
> First question: Does that make sense? I want to make sure I'm
> understanding this right.
>
> Second question, since I don't really understand Drill SQL yet.
>
> What is a query that would pluck the IPSrc.value and IPDest.value from
> this data and make a row of each pair of those?
>
> The top level is a map with a single element named PCAP.
> The "table" is PCAP.Packet which is an array (of maps).
> And within each array item's map the fields of interest are within
> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
> (so maybe IPv4Header is the table?)
> The two fields within there are IPSrc.value (AS src) and IPDest.value (AS
> dest)
>
> I'm lost on how to tell the query that the table is the array PCAP.Packet,
> or the IPv4Header within those maybe?
>
> Maybe this is easy, but I'm just not grokking it yet so I could use some
> help here.
>
> Thanks in advance.
>
> {
> "PCAP": {
> "PCAPHeader": {
> "MagicNumber": "D4C3B2A1",
> "Version": {
> "Major": "2",
> "Minor": "4"
> },
> "Zone": "0",
> "SigFigs": "0",
> "SnapLen": "65535",
> "Network": "1"
> },
> "Packet": [
> {
> "PacketHeader": {
> "Seconds": "1371631556",
> "USeconds": "838904",
> "InclLen": "74",
> "OrigLen": "74"
> },
> "LinkLayer": {
> "Ethernet": {
> "MACDest": "005056E01449",
> "MACSrc": "000C29340BDE",
> "Ethertype": "2048",
> "NetworkLayer": {
> "IPv4": {
> "IPv4Header": {
> "Version": "4",
> "IHL": "5",
> "DSCP": "0",
> "ECN": "0",
> "Length": "60",
> "Identification": "55107",
> "Flags": "0",
> "FragmentOffset": "0",
> "TTL": "128",
> "Protocol": "1",
> "Checksum": "11123",
> "IPSrc": {
> "value": "192.168.158.139"
> },
> "IPDest": {
> "value": "174.137.42.77"
> },
> "ComputedChecksum": "11123"
> },
> "Protocol": "1",
> "ICMPv4": {
> "Type": "8",
> "Code": "0",
> "Checksum": "10844",
> "EchoRequest": {
> "Identifier": "512",
> "SequenceNumber": "8448",
> "Payload":
> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
> }
> }
> }
> }
> }
> }
> },
> {
> "PacketHeader": {
> "Seconds": "1371631557",
> "USeconds": "55699",
> "InclLen": "74",
> "OrigLen": "74"
> },
> "LinkLayer": {
> "Ethernet": {
> "MACDest": "000C29340BDE",
> "MACSrc": "005056E01449",
> "Ethertype": "2048",
> "NetworkLayer": {
> "IPv4": {
> "IPv4Header": {
> "Version": "4",
> "IHL": "5",
> "DSCP": "0",
> "ECN": "0",
> "Length": "60",
> "Identification": "30433",
> "Flags": "0",
> "FragmentOffset": "0",
> "TTL": "128",
> "Protocol": "1",
> "Checksum": "35797",
> "IPSrc": {
> "value": "174.137.42.77"
> },
> "IPDest": {
> "value": "192.168.158.139"
> },
> "ComputedChecksum": "35797"
> },
> "Protocol": "1",
> "ICMPv4": {
> "Type": "0",
> "Code": "0",
> "Checksum": "12892",
> "EchoReply": {
> "Identifier": "512",
> "SequenceNumber": "8448",
> "Payload":
> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
> }
> }
> }
> }
> }
> }
> },
> {
> "PacketHeader": {
> "Seconds": "1371631557",
> "USeconds": "840049",
> "InclLen": "74",
> "OrigLen": "74"
> },
> "LinkLayer": {
> "Ethernet": {
> "MACDest": "005056E01449",
> "MACSrc": "000C29340BDE",
> "Ethertype": "2048",
> "NetworkLayer": {
> "IPv4": {
> "IPv4Header": {
> "Version": "4",
> "IHL": "5",
> "DSCP": "0",
> "ECN": "0",
> "Length": "60",
> "Identification": "55110",
> "Flags": "0",
> "FragmentOffset": "0",
> "TTL": "128",
> "Protocol": "1",
> "Checksum": "11120",
> "IPSrc": {
> "value": "192.168.158.139"
> },
> "IPDest": {
> "value": "174.137.42.77"
> },
> "ComputedChecksum": "11120"
> },
> "Protocol": "1",
> "ICMPv4": {
> "Type": "8",
> "Code": "0",
> "Checksum": "10588",
> "EchoRequest": {
> "Identifier": "512",
> 

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-08-25 Thread Paul Rogers
Hi Mike,

You asked about how to work with nested data items. As noted in a previous
email, this can be a bit tricky. Drill uses SQL, and SQL does not have good
native support for structured data: it was designed in the 1970's for
record oriented data (tuples). Several attempts were made to extend SQL for
structured data, but they didn't really catch on. The one thing that seems
to have "stuck" are the JSON extensions: a field can be of a JSON type,
then you use various functions to work with the data nested within the
JSON. Not very satisfying, but it seems to work: Apache Druid went this
route, for example.

Drill provides the ability to reference a structured item, but doing so
implicitly projects that item to the top level. Suppose we want to display
statistics about packet length. We want only
Packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.Length:

SELECT
  LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.Length AS Length
FROM ...

The above picks out the item you want (I'm supposing that all the layers
are simple maps), but it projects the item to the top level. There is no
syntax in SQL that lets us say, "pick out just that one item, but leave the
existing nested structure". That is, there is no way to say, "Within
IPV4Header, keep Length and IPSrc but skip all the others." Oddly, the EVF
code can do such a projection, but the instructions to do so must come from
the provided schema, not the SQL statement.

The second issue concerns the client using Drill. SQL clients know nothing
about structured data. You could not get Airflow or Tableau or Pandas to
understand the Packet and do anything useful with it: all SQL tools expect
a flattened record. (I'm sure some of these tools can work with data
encoded as JSON, so that is perhaps an option, though it has all manner of
issues.) Indeed, neither ODBC nor JDBC understand structured data. One
would have to use Drill's native API, which is not for the faint of heart.

So, a reasonable goal would be to use Drill to query structured data AND to
project that data into a flat record structure that the client can consume.
This is where you'd need the flatten operator, etc. We'd have to remember
that flattening works down one branch of a tree: one cannot flatten two or
more sibling arrays. Drill also supports lateral joins, which is the fancy
SQL way to express flattening.

You asked, "What is a query that would pluck the IPSrc.value and
IPDest.value from this data and make a row of each pair of those?"

SELECT
  LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.IPSrc.value AS IPSrc,
  LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header.IPDest.value AS IPDest
FROM ...

Or:

SELECT
  header.IPSrc.value AS IPSrc,
  header.IPDest.value AS IPDest
FROM
  SELECT
LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header AS header
  FROM ...

This gives you an output tuple with two columns: (IPSrc, IPDest).

Internally, Drill will notice that you don't want most of the maps and leaf
values. EVF will do the magic to discard them at scan time. A later project
operator will take the remaining rump maps and project the two remaining
values to the top level. Kinda confusing, but it should work.

Just a side comment: if the "value" fields of "IPSrc" and "IPDest" are just
a syntax convention, it would be handy to automatically trim away the
value, and instead treat IPSrc and IPDest as the scalar values. We do
something like this for the Mongo extended JSON types in the JSON reader.

Thanks,

- Paul



On Fri, Aug 25, 2023 at 4:39 PM Mike Beckerle  wrote:

> Below is a small JSON output from Daffodil and below that is the same
> Infoset output as XML.
> (They're inline in this message, but I also attached them as files)
>
> This is just a parse of a small PCAP file with a few ICMP packets in it.
> It's an example DFDL schema used to illustrate binary file parsing.
>
> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
> component schema: https://github.com/DFDLSchemas/ethernetIP)
>
> My theory is that Drill queries against these should be identical to
> obtain the same output row contents.
> That is, since this data has the same schema, whether it is JSON or XML
> shouldn't affect how you query it.
> To do that the XML Reader will need the XML schema (or some hand-provided
> metadata) so it knows what is an array. (Specifically PCAP.Packet is the
> array.)
>
> E.g., if you wanted to get the IPSrc and IPDest fields in a table from all
> ICMP packets in this file, that query should be the same for the JSON and
> the XML data.
>
> First question: Does that make sense? I want to make sure I'm
> understanding this right.
>
> Second question, since I don't really understand Drill SQL yet.
>
> What is a query that would pluck the IPSrc.value and IPDest.value from
> this data and make a row of each pair of those?
>
> The top level is a map with a single element named PCAP.
> The "table" is PCAP.Packet which is an array (of maps).
> And within each array item's map the 

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-08-25 Thread Paul Rogers
Great progress, Mike!

First, let's address the schema issue. As you've probably noticed, Drill's
original notion was that data needed no schema: the data itself provides
sufficient syntactic structure to let Drill infer schema. Also as you've
noticed, this assumption turned out to be more marketing enthusiasm than
technical reality. My favorite example is this JSON: {a: 1}{a: 1.2}. Drill
will fail: the first record has `a` as an integer, but the second value is
a float. By the second value, Drill has already allocated an integer
vector. As Charles has heard me say many times, "Drill cannot predict the
future." If the syntax is ambiguous, Drill may not do the right thing.

To address this, the team added the notion of a "provided schema", but
support is quite limited. There are two parts: planning and execution. At
execution time, a storage plugin can make use of a provided schema: the
JSON reader does so. To solve the ambiguous JSON example above, one
provides a schema that insists that `a` is a DOUBLE (say) regardless of
what the JSON text might suggest. Sounds like the XML parser needs to be
modified to make use of the provided schema. Here, again, there are two
parts. The "EVF" will handle the vector side of things. However, the parser
needs to know to expect the indicated type. That is, if a field `x` is
defined as a repeated DOUBLE (i.e. ARRAY), then the XML parser has
to understand that it will see multiple "x" elements one after another, and
to stick the values into the "x" array vector. I believe that Charles said
this functionality is missing.

We then turn to the other aspect of schema: planning. We'd like the planner
to know the schema so it can validate expressions at plan time. Currently,
Drill assumes that items have a "type to be named later", and makes some
light inferences. If you have SELECT a + b AS c, then we know that a, b and
c are numeric types, but we don't know which specific types they might be:
Drill works those out at run time. With a schema, we should assign types at
plan time, though I suppose Drill should work even without that aspect
(except that I'm sure someone will find a corner case that produces a
different result depending on whether we use the proper types at plan time.)

Another aspect is how the execution planner knows to include an included
schema in the JSON sent to the XML reader. My memory here is hazy: we have
something as demonstrated by the JSON reader. Presumably this same
technique can be used for XML. In particular, there is some JSON element
that holds the provided schema, and there is some way in the planner to
populate that element.

Another question is how Drill is made aware of the schema. This is another
area where Drill's features are somewhat incomplete. Originally, there was
a hidden JSON file in the target HDFS folder that could gather Parquet
metadata. Toward the end of the MapR era, the team added a metastore. The
Hive-based readers use the Hive metastore (HMS). Oddly, Drill cannot use
HMS for Drill's own native readers because of the issue mentioned at the
top: Drill was designed to not need a schema.

We can then ask, where would the Daffodil schemas reside? In a directory
somewhere? In a web service? In a "Daffodil metastore"? How does Daffodil
associate a schema with a file? Or, is that something that the user has to
do? The answer to this will determine how to integrate the Daffodil schema
with Drill.

Drill provides the ability to use table functions to provide extra
properties to a reader. Again, the details have become hazy, but the JSON
tests have examples, I believe. So, one solution is to convert the Daffodil
schema to the Drill schema, and have Drill read that file for each query.
Not very satisfying. Better would be to point Drill to the Daffodil schema,
and let Drill do the conversion. I don't believe we have such a mechanism
at present.

The ideal would be a unified concept of schema: a schema reader that
converts the schema to Drill format, and consistent planner and execution
use of that schema. You would just then need to create a "Daffodil schema
provider." Perhaps there is a way to leverage the metastore API to do this?

The final step would be to automate the association of files with Daffodil
schema: some kind of metastore or registry that says that "file foo.xml
uses schema pcap.dfdl" or whatever.

FWIW, Drill has the notion of a "strict" vs. "lenient" schema. A strict
schema says that the file must include only those fields in the schema. A
lenient schema says that the file may have additional fields, and those
fields use Drill's plain old data inference for those extras.

Sounds like the next three steps for you would be:

1. Extend the XML reader to use a provided schema.
2. Extend the XML reader to support arrays, as indicated by the schema.
3. Test the above using the query-time schema mechanism as demonstrated by
the schema-aware JSON reader tests.

Once that works, you can then move onto the plan-time issues.