FYI: progress report on separated sequences and separator suppression

Beckerle, Mike Fri, 03 May 2019 20:27:13 -0700

This email is FYI only. You can skip it unless you care about this specific 
development topic.


I have made quite a lot of progress this past week so I thought it worth 
reporting about given that fixing these issues so that EDIFACT and TLOG can 
work has been delayed for so long.

There are numerous JIRA tickets associated with this problem area:
DAFFODIL-1080, DAFFODIL-1976, DAFFODIL-1886, DAFFODIL-1919, DAFFODIL-110

On my branch, separated sequences have been substantially revised, and I hope 
to get this into code review as a PR in a few days.

I added a new property daf:emptyElementPolicy intended to control whether 
Daffodil implements the DFDL spec, or bends the rules in order to be compatible 
with IBM DFDL so that we can run their published DFDL schemas on github.

The status of my daffodil-1080-sep branch is that all tests in daffodil-test 
pass.

There are exactly 3 errors in daffodil-test-ibm1
    
test_AX000
test_ptLax1rt
    
    The above fail because daffodil doesn't implement the behavior where
    empty strings are only created for optional string elements when there
    is some non-zero-length syntax defined by dfdl:emptyValueDelimiterPolicy
    and initiator/terminator.

    Daffodil is creating a empty string value here based on just the
    presence of a separator, which is incorrect. 
    When dfdl:separatorSuppressionPolicy is trailingEmpty (or 
trailingEmptyStrict), then
    this should NOT create an empty string value. It should just tolerate
    the separator (or not for trailingEmptyStrict)
    
test_ptg3_1p_ibm_daf
    
    The above fails because in the new daf:emptyElementPolicy
    noEmptyElements mode, daffodil does not cause a 
    processing error on a required (scalar or required
    array element < minOccurs) string element that has empty-string as its
    value. This causes a parse error on IBM DFDL, and the daf:emptyElementPolicy
    of noEmptyElements is supposed to be compatible with this. 
    (In addition if a default value is specified, then we need to produce a
    runtime SDE, so that this will not backtrack. Also consistent with IBM DFDL 
behavior.)
    Right now daffodil is creating empty-string elements here. Which it
    shouldn't be doing in this compatibility noEmptyElements mode, but in
    regular emptyElementPolicy="emptyElements" this would be correct
    behavior. 

I believe fixing the above will fix several of the regressions on published 
DFDL schemas also.
 
This change set is extensive enough that I also ran all the published DFDL
schemas from DFDLSchemas site on github (and iCalendar as well)

Published Schema Regressions:
    
iCalendar - now gets a SDE - implicit with unbounded maxOccurs only
allowed on last declared element of sequence. This is not due to my
changes, but a check that has been added recently. 

mil-std-2045 - 2 tests fail. One is Terminator 7F not found, the other is 
empty children related: expected 5 children got 3. Probably same issue
as identified above for one of the daffodil-test-ibm1 tests.

png - many tests fail. All for same reason: expected 1 child got 0. 
Probably same issue as identified above for one of the daffodil-test-ibm1 tests.

(Also bmp - fails with java out of heap space, but that was true of 
2.3.0 released version of Daffodil - see DAFFODIL-2118)
   
Now of course the objective of these separated sequence changes is to get more 
published 
DFDL schemas to run. Specifically, EDIFACT, and ibm4690-TLog (aka TLOG). 
 
Progress on EDIFACT
* The one test fails for same reason as test_ptg3_1p_ibm_daf, or at least that
is what it is currently clearly failing on. It runs and produces an infoset. 
Note: EDIFACT takes like a minute+ to compile the schema. Ugh. 

 Progress on TLOG
 * 2 of 5 tests pass
 * 3 others fail - reasons as yet unanalyzed. They run, and produce infosets. 
Those infosets
aren't the same as what is expected. 

Final point: performance - the unparser for separated sequences with separator 
suppression uses some pretty heavy-weight techniques - it creates suspendable 
unparsers for the separators that might be suppressed. The performance 
implications of this are as yet unexamined. I've been focused on just getting 
the behavior to be right first.

FYI: progress report on separated sequences and separator suppression

Reply via email to