I was recently asked about Daffodil and processing gigantic (multi-gigabyte)
PCAP files.
We have two features planned for 2.2.0 which help with memory footprint issues.
BLOBs, and the message-streaming feature.
At first glance, neither will help with giant PCAP files. It's a large file
that is not large because of a binary BLOB in it like an image or video file
would be, but it also is a single giant tree, not a stream of messages all with
the same root element.
This use case was the justification for requesting a SAX-style true
event-driven behavior for Daffodil.
Long term that's great, but SAX is complex to implement given DFDL and
points-of-uncertainty/backtracking in the parser, so I wanted to explore
whether with some small API changes we could dodge this SAX-bullet at least for
PCAP.
So for PCAP, the file consists of a global header, and then a bunch of
packets. The packets are exactly like a message stream, if we could just skip
past the header while keeping the state we need from it, like the byte order,
then a message streaming pull-type parser would be ideal.
(We would also need the symmetric unparser behavior)
Our ProcessorFactory method pf.onPath("/Packets") would in theory be usable
with the message streaming API to sequence through just the packets with each
parse call returning the Infoset for one packet. The path given to pf.onPath is
supposed to be a path to an array element, relative to the root element that
the PF was compiled for.
What is involved in implementing pf.onPath(...), that actually steps downward
into the data stream to skip past some material before beginning the iteration?
For unparsing, things aren't quite so symmetric. We need there to provide the
infoset events for the part of the data we're skipping past with the
onPath(...).
It would probably be sufficient to implement a very simple subset of path
expressions. E.g., only to first level arrays, not arrays within arrays, not to
anything inside of a nested point of uncertainty, etc.
If this is easy this may allow us to postpone the SAX stuff longer. If this is
complex, then I would guess it isn't worth it and we should just go for true
event-style parse and unparse.
Thoughts?