Re: QUESTION: Packet Parser for PCAP Plugin

2019-04-23 Thread Charles Givre
Hi Ted, 
I thought about that approach as well.  My concern was cluttering up the plugin 
with lots of columns, especially as we add different protocols.  However, if 
that's not a concern, I can have a go at it.  

I was thinking the same thing about the Kaitai struct.  Would it be possible to 
have some generic reader such that you provide the schema, and Drill would map 
that to columns as appropriate.  That way you could use all the formats pretty 
much instantly from the Kaitai format gallery. 



> On Apr 23, 2019, at 5:08 PM, Ted Dunning  wrote:
> 
> Wow. Kaitai looks fabulous. It would be tempting to define a generic format
> that could use a kaitai spec to define the format of a file.
> 
> Regarding the map output, I think we solved the same problem in the PCAP
> parser itself by simply putting all of the fields at the top level and
> making them nullable. This means that the UDP stuff is null for TCP packets.
> 
> The same approach could be taken for other packets. If parsing is lazy,
> then reference to a parsed column would be required to trigger the parsing
> of a packet.
> 
> 
> 
> On Tue, Apr 23, 2019 at 10:52 AM Charles Givre  wrote:
> 
>> Hi Ted
>> The library that gave me the idea is the Kaitai struct.  The java library
>> itself is released under the Apache or MIT license.  It can parse a number
>> of binary formats including DNS packets, ICMP and many others.  It accepts
>> a byte[] as input. I already wrote working code that reads it but I’m not
>> sure how to output these results in Drill.
>> 
>> Sent from my iPhone
>> 
>>> On Apr 23, 2019, at 12:45, Ted Dunning  wrote:
>>> 
>>> I think this would be very useful, particularly if it is easy to add
>>> additional parsing methods.
>>> 
>>> When I started to pcap work, I couldn't find any libraries that combined
>>> what we needed in terms of function and license.
>>> 
 On Tue, Apr 23, 2019, 9:34 AM Charles Givre  wrote:
 
 Hello all,
 I saw a few open source libraries that parse actual packet content and
>> was
 interested in incorporating this into Drill's PCAP parser.  I was
>> thinking
 initially of writing this as a UDF, however, I think it would be much
 better to include this directly in Drill.  What I was thinking was to
 create a field called parsed_packet that would be a Drill Map.  The
 contents of this field would vary depending on the type of packet.  For
 instance, if it is a DNS packet, you get all the DNS info, ICMP etc...
 Does the community think this is a good idea?   Also, given the
>> structure
 of the PCAP plugin, I'm not quite sure how to create a Map field with
 variable contents.  Are there any examples that use the same
>> architecture
 as the PCAP plugin?
 Thanks,
 -- C
>> 



Re: QUESTION: Packet Parser for PCAP Plugin

2019-04-23 Thread Ted Dunning
Wow. Kaitai looks fabulous. It would be tempting to define a generic format
that could use a kaitai spec to define the format of a file.

Regarding the map output, I think we solved the same problem in the PCAP
parser itself by simply putting all of the fields at the top level and
making them nullable. This means that the UDP stuff is null for TCP packets.

The same approach could be taken for other packets. If parsing is lazy,
then reference to a parsed column would be required to trigger the parsing
of a packet.



On Tue, Apr 23, 2019 at 10:52 AM Charles Givre  wrote:

> Hi Ted
> The library that gave me the idea is the Kaitai struct.  The java library
> itself is released under the Apache or MIT license.  It can parse a number
> of binary formats including DNS packets, ICMP and many others.  It accepts
> a byte[] as input. I already wrote working code that reads it but I’m not
> sure how to output these results in Drill.
>
> Sent from my iPhone
>
> > On Apr 23, 2019, at 12:45, Ted Dunning  wrote:
> >
> > I think this would be very useful, particularly if it is easy to add
> > additional parsing methods.
> >
> > When I started to pcap work, I couldn't find any libraries that combined
> > what we needed in terms of function and license.
> >
> >> On Tue, Apr 23, 2019, 9:34 AM Charles Givre  wrote:
> >>
> >> Hello all,
> >> I saw a few open source libraries that parse actual packet content and
> was
> >> interested in incorporating this into Drill's PCAP parser.  I was
> thinking
> >> initially of writing this as a UDF, however, I think it would be much
> >> better to include this directly in Drill.  What I was thinking was to
> >> create a field called parsed_packet that would be a Drill Map.  The
> >> contents of this field would vary depending on the type of packet.  For
> >> instance, if it is a DNS packet, you get all the DNS info, ICMP etc...
> >> Does the community think this is a good idea?   Also, given the
> structure
> >> of the PCAP plugin, I'm not quite sure how to create a Map field with
> >> variable contents.  Are there any examples that use the same
> architecture
> >> as the PCAP plugin?
> >> Thanks,
> >> -- C
>


Re: QUESTION: Packet Parser for PCAP Plugin

2019-04-23 Thread Paul Rogers
Hi Charles,

Two comments. 

First, Drill "maps" are actually structs (nested tuples): every record must 
have the same set of columns within the "map." That is, though the Drill type 
is called a "map", and you might assume that, given that name, it would act 
like a JSON, Python of Java map, the actual implementation is, in fact, a 
struct. (I saw a JIRA ticket to rename the Map type in some context because of 
this unfortunate mismatch of name and implementation.)

By contrast, Hive defines both Map and Struct types. A Drill "Map" is like a 
Hive Struct, and Drill has no equivalent of a Hive Map. Still, there are 
solutions.

To use a single parsed_packet map column, you'd have to know the union of all 
the columns you'll create across all the packet types and define a map schema 
that includes all these columns. Define this map in all batches so you have a 
consistent schema. This means including all columns for all packet types, even 
if the data does not happen to have all packet types.

Or, you could define a different map for each packet type; but you'd still have 
to define the needed ones up front. You could do this if you had columns 
called, say, parsed_x_packet, parsed_y_packet, etc. If that packet type is 
projected (appears in the SELECT ... clause), then define the required schema 
for all records. The user just selects the packet types of interest.

This brings us to the second comment. The long work to merge the row set 
framework into Drill is coming to a close, and it is now available for you to 
use. The row set framework provides a very simple way to define your map 
schemas (once you know what they are). It also handles projection:the user 
selects some of your parsed packets, but not others, or projects some of the 
packet map columns, but not others.

Drill 1.16 migrates the CSV reader to the new framework (where it also supports 
user-defined schemas and type conversions.) The next step in the row set work 
is to migrate a few other readers to the new framework. Perhaps, PCAP might be 
a good candidate to enable your new packet-parsing feature.


Thanks,
- Paul

 

On Tuesday, April 23, 2019, 9:34:16 AM PDT, Charles Givre 
 wrote:  
 
 Hello all,
I saw a few open source libraries that parse actual packet content and was 
interested in incorporating this into Drill's PCAP parser.  I was thinking 
initially of writing this as a UDF, however, I think it would be much better to 
include this directly in Drill.  What I was thinking was to create a field 
called parsed_packet that would be a Drill Map.  The contents of this field 
would vary depending on the type of packet.  For instance, if it is a DNS 
packet, you get all the DNS info, ICMP etc...
Does the community think this is a good idea?  Also, given the structure of the 
PCAP plugin, I'm not quite sure how to create a Map field with variable 
contents.  Are there any examples that use the same architecture as the PCAP 
plugin?
Thanks,
-- C  

Re: QUESTION: Packet Parser for PCAP Plugin

2019-04-23 Thread Charles Givre
Hi Ted
The library that gave me the idea is the Kaitai struct.  The java library 
itself is released under the Apache or MIT license.  It can parse a number of 
binary formats including DNS packets, ICMP and many others.  It accepts a 
byte[] as input. I already wrote working code that reads it but I’m not sure 
how to output these results in Drill. 

Sent from my iPhone

> On Apr 23, 2019, at 12:45, Ted Dunning  wrote:
> 
> I think this would be very useful, particularly if it is easy to add
> additional parsing methods.
> 
> When I started to pcap work, I couldn't find any libraries that combined
> what we needed in terms of function and license.
> 
>> On Tue, Apr 23, 2019, 9:34 AM Charles Givre  wrote:
>> 
>> Hello all,
>> I saw a few open source libraries that parse actual packet content and was
>> interested in incorporating this into Drill's PCAP parser.  I was thinking
>> initially of writing this as a UDF, however, I think it would be much
>> better to include this directly in Drill.  What I was thinking was to
>> create a field called parsed_packet that would be a Drill Map.  The
>> contents of this field would vary depending on the type of packet.  For
>> instance, if it is a DNS packet, you get all the DNS info, ICMP etc...
>> Does the community think this is a good idea?   Also, given the structure
>> of the PCAP plugin, I'm not quite sure how to create a Map field with
>> variable contents.  Are there any examples that use the same architecture
>> as the PCAP plugin?
>> Thanks,
>> -- C


Re: QUESTION: Packet Parser for PCAP Plugin

2019-04-23 Thread Ted Dunning
I think this would be very useful, particularly if it is easy to add
additional parsing methods.

When I started to pcap work, I couldn't find any libraries that combined
what we needed in terms of function and license.

On Tue, Apr 23, 2019, 9:34 AM Charles Givre  wrote:

> Hello all,
> I saw a few open source libraries that parse actual packet content and was
> interested in incorporating this into Drill's PCAP parser.  I was thinking
> initially of writing this as a UDF, however, I think it would be much
> better to include this directly in Drill.  What I was thinking was to
> create a field called parsed_packet that would be a Drill Map.  The
> contents of this field would vary depending on the type of packet.  For
> instance, if it is a DNS packet, you get all the DNS info, ICMP etc...
> Does the community think this is a good idea?   Also, given the structure
> of the PCAP plugin, I'm not quite sure how to create a Map field with
> variable contents.  Are there any examples that use the same architecture
> as the PCAP plugin?
> Thanks,
> -- C


QUESTION: Packet Parser for PCAP Plugin

2019-04-23 Thread Charles Givre
Hello all,
I saw a few open source libraries that parse actual packet content and was 
interested in incorporating this into Drill's PCAP parser.  I was thinking 
initially of writing this as a UDF, however, I think it would be much better to 
include this directly in Drill.  What I was thinking was to create a field 
called parsed_packet that would be a Drill Map.  The contents of this field 
would vary depending on the type of packet.  For instance, if it is a DNS 
packet, you get all the DNS info, ICMP etc...
Does the community think this is a good idea?   Also, given the structure of 
the PCAP plugin, I'm not quite sure how to create a Map field with variable 
contents.  Are there any examples that use the same architecture as the PCAP 
plugin?
Thanks,
-- C