thoughts on fixing trailing separator suppression

Mike Beckerle Tue, 22 May 2018 12:55:58 -0700

I haven't completed this note yet, but as I am having to shift gears for a few 
days, I am providing my current state of thinking around what's actually 
required here for posterity/amusement, and to share what I anticipate will 
actually be a great improvement to Daffodil, but non-trivial to implement.



Comments are welcome. This needs to turn into a plan/design for the fix.


-----------------------


Design/Redesign Issues for Sequences with Separators

We recently found issues associated with nested separators, and with 
dfdl:separatorSuppressionPolicy 'trailing' and 'trailingStrict'.

It's not possible to get the IBM4690TLog DFDL schema to work on Daffodil until 
these are fixed, and the bug with nested separators also has impact on the DFDL 
schema for USMTF.

Fixing these is going to require some refactoring and reimplementation starting 
from the daffodil-core grammar package.

The Problems

The grammar has some mistakes in it that make it impossible to properly 
implement.

* The first place we look at dfdl:separatorSuppressionPolicy is in the grammar 
production for arrayContentsWithSeparators. This is incorrect because separator 
suppression can apply to terms in a sequence that are not related by being part 
of an array.

That said, it is not terribly important (from a bug standpoint) to support 
separator suppression except for when there are repeating/optional elements 
involved.

The bigger problem is:

* The separatedForArrayPosition(body) method composes together the separators 
with the term following them that is separated. This prevents independent 
orchestration of the parse of the separator with that of the following element. 
It is depending on the parse of the separator failing, and that causing 
backtracking to try subsequent alternatives, as a way of identifying which 
delimiter was actually found. This could be made to work, but it requires 
changes to the way separators and all delimiters really, are parsed.

The current design, where separators are composed together into the child 
terms, makes implementation of trailing separator suppression basically 
impossible.

Fact is, all of this was designed to recursively compose by just composing 
parsers from other parsers, but that turns out to be naive because it doesn't 
provide enough structure to issue really good clear diagnostics, and it is too 
hard to tell what is going on.

-----------------------------------------------

A sequence combinator should be laid down depending on whether the sequence has 
separators or not, and whether it has trailing separator suppression or not.

That is, the decision we see being made in arrayContentsWithSeparators, needs 
to be made in SequenceGrammarMixin in the orderedSequenceContent production. 
(When implemented unorderedSequenceContent may be similar.)

The kind of sequence combinator being laid down should depend on separators or 
not right at that point, and the sequence combinator for the hasSeparators case 
will have a rich set of parameters.

The asTermInSequence, and asTermInArray both need to go away. Instead of 
combining all these characteristics into one composed term that we 
parse/unparse, with separators embedded into it, whether it is even represented 
embedded in it, etc. Instead we need an array of tuples, each tuple is a 
(termContentBody, termRuntimeData) where termRuntimeData lets us know if the 
term isNotRepresented, is scalar, or is array, and what the max/min occurs are, 
etc.

So the SeparatedSequenceCombinator then must orchestrate everything about 
prefix/posfix separators, and the loops needed for repeating/array/optional 
terms.

I expect we will have at least

SeparatedSequenceCombinator
UnseparatedSequenceCombinator

I believe the entire LocalElementGrammarMixin will have to be 
refactored/rewritten. I think every method will likely have to be removed.

All these methods have to go away:
separatedEmpty
separatedRecurring
StopValue - which isn't implemented anyway, so it's just clutter.
recurrance
separatedContentWithMinUnboundedWithoutTrailingEmpties
separatedContentWithMinAndMaxWithoutTrailingEmpties
separatedContentWithMinUnbounded
separatedContentWithMinAndMax
separatedContentZeroToUnbounded
separatedContentAtMostNWithoutTrailingEmpties
stopValueSize
separatedContentExactlyN
separatedContentExactlyNComputed
arrayContents
arrayContentsNoSeparators
arrayContentsWithSeparators
contentUnbounded

Instead, LocalElementGrammarMixin - which is all about recurrance/optionality - 
will be in service of several kinds of sequence combinators. Its methods will 
provide the things needed in order for proper compilation of the parameters 
given to a sequence combinator. It may not be a grammar mixin anymore. It may 
just become methods on LocalElementMixin.

The chosen sequence combinator must interoperate with the repeat combinators 
when dealing with separators. Or alternatively, it must subsume their 
functionality.

---------------------------
The below is probably NOT true:

Interplay of field isolation that scans for a value, ending at a delimiter, 
vs., after that, we parse the delimiter again, is where we get into trouble.

We're depending on backtracking - when it tries to "reparse" the separator, but 
doesn't find the separator, then backtracking causes the parser to then again 
parse the field, but then try the next outward possible terminating markup. 
Until something parses successfully.

------------------

TODO and FIXME comments in the DelimiterText code suggest this is not being 
done properly - elements that could NOT be terminated by the sequence's 
terminator are nevertheless being scanned with the sequence's terminator as one 
of the possible delimiters.  (Though the semantics of nested delimiters are a 
bit unclear here. For example, if a simple type element has a terminator, then 
one would *think* that one would not need to escape the enclosing group 
separator appearing within it, because the only thing that matters is the 
terminator; however, this is not the case. One must still escape the enclosing 
separator. (This is the agreed upon view of the DFDL Workgroup. I am not sure 
it is right... but we may have to introduce a different lengthKind='delimited2' 
variation to provide the alternate behavior where a terminator on a simple 
element eliminates the need to escape anything except that terminator.

-----------------------

If we create new Sequence combinators and eliminate the composition of 
separators into child term parsers, we can better orchtestrate how the 
delimited term parsers identify a simple-type element, scanning until some 
terminating markup is found, with the testing for whether what was found is the 
expected separator or not.

That is, we can eliminate the redundant parse of the delimiter that happens 
currently - first to isolate the element value, second to consume the 
delimiter, verifying that it is an expected separator, etc.

thoughts on fixing trailing separator suppression

Reply via email to