I've been dealing with lots of data formats lately which involve lookahead
needs, and I'm trying to come up with a cleaner, easier, and less ad-hoc
way to deal with them in DFDL to propose for the future. I wanted to bounce
this idea off the user community before prototyping.

Idea: Property dfdlx:peek="yes/no". Property on a sequence model group.
Compatible with a hidden group ref. Unlike other properties, which are
disallowed on a sequence with dfdl:hiddenGroupRef, this could be allowed on
hidden sequences. That's actually a primary use case for this.

If "no", no behavior change (what things do now).

If "yes", the parse happens, including set variable assignments, infoset
creation, etc. Then at the end of the sequence if the parse is successful
the position in the data stream is reset to where it was at the start of
the peek sequence. If the parse fails you backtrack to the enclosing PoU as
normal and everything about the peek (when inside the PoU) is discarded.

This allows the parse to learn by parsing something in the data more than
once. Once to discover something which goes into the parser infoset (hidden
or not), and into single-assignments to DFDL variables. The second time can
parse making use of this learning.

This is sort of like backtracking at a PoU, but you don't undo anything
except the position in the data stream.

On unparsing, all data written while unparsing the infoset for a sequence
with dfdl:peek="yes" is discarded. Or maybe we can just say the infoset
corresponding to a sequence with dfdl:peek="yes" is not unparsed at all.

Implementations could put a limit on how far ahead you can peek. But a
minimum of say, 512 bytes or maybe a bit bigger makes sense I think. That
would be enough for every use case I have.

I believe current restrictions in DFDL to ensure forward progress when
parsing are sufficient to make it impossible to delay parsing forever with
this. I.e., parsing can take a long time, but it still has to terminate (at
least in theory, if there is enough memory for a big infoset).

I think this dfdlx:peek has some nice properties.

Pro: This is the most important thing: No specialized constructs for
looking ahead. Just use DFDL to learn about the data, save it in variables
or a piece of infoset that you can navigate with expressions to utilize the
knowledge.

Pro: Composition properties are good. Nothing new to learn. I can think of
no impact on backtracking or any other aspects.

Pro: Pretty cheap to implement, so long as the amount you can peek ahead is
reasonably bounded.

Pro: Synergistic with existing things like newVariableInstance and hidden
groups to capture learning from a peek ahead.

Con: The "really big hammer" problem. Everything looks like a nail. I.e.,
this has huge generality. Peeking ahead with a sequence with really rich
sub-structure, PoUs and backtracking inside it, etc. That's all enabled by
this feature, but none of the use cases I have need anything like that
level of generality. This is one of those things where the stuff people
will invent with it are unanticipated.

All thoughts / musings are welcome.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com

Reply via email to