I was recently asked about Daffodil and processing gigantic (multi-gigabyte) PCAP files.
We have two features planned for 2.2.0 which help with memory footprint issues. BLOBs, and the message-streaming feature. At first glance, neither will help with giant PCAP files. It's a large file that is not large because of a binary BLOB in it like an image or video file would be, but it also is a single giant tree, not a stream of messages all with the same root element. This use case was the justification for requesting a SAX-style true event-driven behavior for Daffodil. Long term that's great, but SAX is complex to implement given DFDL and points-of-uncertainty/backtracking in the parser, so I wanted to explore whether with some small API changes we could dodge this SAX-bullet at least for PCAP. So for PCAP, the file consists of a global header, and then a bunch of packets. The packets are exactly like a message stream, if we could just skip past the header while keeping the state we need from it, like the byte order, then a message streaming pull-type parser would be ideal. (We would also need the symmetric unparser behavior) Our ProcessorFactory method pf.onPath("/Packets") would in theory be usable with the message streaming API to sequence through just the packets with each parse call returning the Infoset for one packet. The path given to pf.onPath is supposed to be a path to an array element, relative to the root element that the PF was compiled for. What is involved in implementing pf.onPath(...), that actually steps downward into the data stream to skip past some material before beginning the iteration? For unparsing, things aren't quite so symmetric. We need there to provide the infoset events for the part of the data we're skipping past with the onPath(...). It would probably be sufficient to implement a very simple subset of path expressions. E.g., only to first level arrays, not arrays within arrays, not to anything inside of a nested point of uncertainty, etc. If this is easy this may allow us to postpone the SAX stuff longer. If this is complex, then I would guess it isn't worth it and we should just go for true event-style parse and unparse. Thoughts?