large PCAP files and changes to message-streaming API for onPath() support vs. SAX behavior

Mike Beckerle Thu, 29 Mar 2018 09:04:21 -0700

I was recently asked about Daffodil and processing gigantic (multi-gigabyte) 
PCAP files.



We have two features planned for 2.2.0 which help with memory footprint issues. 
BLOBs, and the message-streaming feature.


At first glance, neither will help with giant PCAP files. It's a large file 
that is not large because of a binary BLOB in it like an image or video file 
would be, but it also is a single giant tree, not a stream of messages all with 
the same root element.


This use case was the justification for requesting a SAX-style true 
event-driven behavior for Daffodil.


Long term that's great, but SAX is complex to implement given DFDL and 
points-of-uncertainty/backtracking in the parser, so I wanted to explore 
whether with some small API changes we could dodge this SAX-bullet at least for 
PCAP.


So for PCAP,  the file consists of a global header, and then a bunch of 
packets. The packets are exactly like a message stream, if we could just skip 
past the header while keeping the state we need from it, like the byte order, 
then a message streaming pull-type parser would be ideal.


(We would also need the symmetric unparser behavior)


Our ProcessorFactory method pf.onPath("/Packets") would in theory be usable 
with the message streaming API to sequence through just the packets with each 
parse call returning the Infoset for one packet. The path given to pf.onPath is 
supposed to be a path to an array element, relative to the root element that 
the PF was compiled for.


What is involved in implementing pf.onPath(...),  that actually steps downward 
into the data stream to skip past some material before beginning the iteration?


For unparsing, things aren't quite so symmetric. We need there to provide the 
infoset events for the part of the data we're skipping past with the 
onPath(...).


It would probably be sufficient to implement a very simple subset of path 
expressions. E.g., only to first level arrays, not arrays within arrays, not to 
anything inside of a nested point of uncertainty, etc.


If this is easy this may allow us to postpone the SAX stuff longer. If this is 
complex, then I would guess it isn't worth it and we should just go for true 
event-style parse and unparse.


Thoughts?

large PCAP files and changes to message-streaming API for onPath() support vs. SAX behavior

Reply via email to