Re: Issue with separatorSuppressionPolicy and empty elements in IBM4690-TLOG scheams

Mike Beckerle Tue, 28 Nov 2017 10:52:31 -0800

So the relevant grammar productions are in LocalElementGrammarMixin.scala, and 
there are lots of comments in there that

indicate a known limitation e.g.,

in lazy val separatedContentAtMostN

//FIXME: we don't know whether we can absorb trailing separators or not.

// We don't know if this repeating thing is in trailing position or in the 
middle of a sequence.

In the lazy val arrayContentsWithSeparators we find this line:

      case (Trailing___, Implicit__, max, ___) => separatedContentAtMostN // 
FIXME: have to have all of them - not trailing position

The comment there indicates that this is insufficient.

I think we're going to need a bunch of tests that do not use binary data, just 
text, in order to test all the combinations here, so that we can see what is 
going wrong.

I don't really understand a delimited by separator situation where minOccurs is 
zero, maxOccurs 1.

If we start from the grammar production for recurrance, and I inline substitute 
the productions that match guards for these optional elements (i.e., "a" and 
"b) we will get

OptionalCombinator(

  RepExactlyN(self, 0, separatedRecurringDefaultable) ~
      RepAtMostTotalN(this, 1, separatedRecurringNonDefault) )

And RepExactlyN(self, 0, ...) should get an assertion failure because in the 
constructor for a base class it insists that N > 0.

At least that's how it looks to me. Is that what you are getting?

Seems to me RepExactlyN when N is zero should simply optimize out - the guard 
should be false if N is zero. That would fix the assertion failure.

________________________________
From: Joshua Adams
Sent: Tuesday, November 28, 2017 12:37:46 PM
To: Mike Beckerle; Steve Lawrence; dev@daffodil.apache.org
Subject: Re: Issue with separatorSuppressionPolicy and empty elements in 
IBM4690-TLOG scheams

That is correct.  I tried using trailingEmptyStrict after previously using 
trailingEmpty just to see if that would make a difference.

In the sample data file, ace_00_01.dat, the separators for the missing 
SpecialTime element are present, as the data looks like this: ...:<TenderTime 
data>::<InactiveTime data>:...

Josh

________________________________
From: Mike Beckerle
Sent: Tuesday, November 28, 2017 12:34:09 PM
To: Steve Lawrence; Joshua Adams; dev@daffodil.apache.org
Subject: Re: Issue with separatorSuppressionPolicy and empty elements in 
IBM4690-TLOG scheams

When I look at the IBM TLog schema (TLogAceFormat.xsd)  I see separator policy 
is "suppressedAtEndLax" which is the old property value for  
dfdl:separatorSuppressionPolicy "trailingEmpty". I.e., not strict per 
discussion below.

I want to make sure we're looking at the same schema. This matters in that for 
the optional items, separators can be present or absent.

________________________________
From: Steve Lawrence <slawre...@apache.org>
Sent: Tuesday, November 28, 2017 8:08:25 AM
To: Mike Beckerle; Joshua Adams; dev@daffodil.apache.org
Subject: Re: Issue with separatorSuppressionPolicy and empty elements in 
IBM4690-TLOG scheams

I think it might help to show some schema snippets and the resulting
parsers to get an idea of what is going on. A stripped down snippit of
the tlog schema looks something like this:

  <dfdl:format occursCountKind="implicit"
               lengthKind="delimited"
               separatorSuppressionPolicy="trailingEmptyStrict"
               separatorPosition="prefix" />

  <xs:element name="root">
    <xs:complexType>
      <xs:sequence dfdl:separator=":">
        <xs:element name="a" type="xs:long" />
        <xs:element name="b" type="xs:long" minOccurs="0" />
        <xs:element name="c" type="xs:long" minOccurs="0" />
      </xs:sequence>
     </xs:complexType>
  </xs:element>

According the the tlog data and expected infoset, the colon separator
should always exist, even if the elements do not exist. So each of the
following are valid data:

  5:6:7
  5:6:
  5::7
  5::

So there's a mandatory element "a", followed by some optional elements
"b" and "c", and the separators always exist. The generated parser for
this looks like this:

  <seq>
    <Element name="a">
      ...
    </Element>
    <Optional>
      <RepAtMostTotalN name="b" n="1">
        <seq>
          <Separator/>
          <Element name="b">
          ...
          </Element>
        </seq>
      </RepAtMostTotalN>
    </Optional>
    <Optional>
      <RepAtMostTotalN name="c" n="1">
        <seq>
          <Separator/>
          <Element name="c">
          ...
          </Element>
        </seq>
      </RepAtMostTotalN>
    </Optional>
  </seq>

The ... in the above are the parsers for finding delmiters and
converting the delimited text to a string, which isn't too important here.

So it first parsers element "a". Then it optionally parsers 0 to 1
element "b"'s, where each b that is parsed must be preceeded by a
separator. Note however, that if element b fails to parse, we backtrack
so that the separator was not consumed. And element "b" will fail to
parse on zero length delimited value, since only xs:hexBinary and
xs:string allow zero-length representations). Same thing goes for
element "c". Which means if elements "b" or "c" do not exist in the
data, the preceeding separator will not be consumed, which is not what
we want.

I think perhaps we want something like the below instead?

  <seq>
    <Element name="a">
      ...
    </Element>
    <Separator/>
    <Optional>
      <RepAtMostTotalN name="b" n="1">
        <seq>
          <OptionalInfixSep><Separator/><OptionalInfixSep>
          <Element name="b">
          ...
          </Element>
        </seq>
      </RepAtMostTotalN>
    </Optional>
    <Separator/>
    <Optional>
      <RepAtMostTotalN name="c" n="1">
        <seq>
          <OptionalInfixSep><Separator/><OptionalInfixSep>
          <Element name="c">
          ...
          </Element>
        </seq>
      </RepAtMostTotalN>
    </Optional>
  </seq>

So in between each <Optional> or <Element> are mandatory <Separator>'s,
and each RepAtMostTotalN contains an OptionalInfixSep which will only
consume a Separator when more than one element exist. It's not
immediately obvious to me where this change in the grammar should occur,
of if this is even correct, but this might help provide some
insight/background.

- Steve

On 11/27/2017 05:00 PM, Mike Beckerle wrote:
> Well the separator suppression code has not had a lot of scrutiny. I wrote 
> this
> a *long time* ago, and honestly have not revisited it since. I assume you
> figured out that separatorSuppressionPolicy replaced the separatorPolicy
> property. This happened after IBM released it's first DFDL product, as a 
> result
> handling both the old and new property names was required.
>
>
> For any of these packed numbers, if you are using delimited lengthKind, then
> zero-length is possible, and it means "absent", meaning that if optional, the
> element is not present. If required, it's an error unless zero-length 
> triggers a
> nil value. If an element is both optional, and empty is a legitimate value, 
> then
> I think empty->optional not present is the winner, but I have to look it up.
>
>
> I wasn't sure what you meant below by "....for IBM4690 and other packed binary
> formats the associated separators aren't processed,...".
>
>
> Probably best for us to talk this through on phone tomorrow (Tuesday). Look 
> for
> me on the instant messenger.
>
> --------------------------------------------------------------------------------
> *From:* Joshua Adams
> *Sent:* Monday, November 27, 2017 3:59:14 PM
> *To:* Mike Beckerle; dev@daffodil.apache.org
> *Cc:* Stephen Lawrence
> *Subject:* Issue with separatorSuppressionPolicy and empty elements in
> IBM4690-TLOG scheams
>
> Hey Mike,
>
> Wanted to get your opinion on the issue I've been running into with 
> IBM4690-TLOG
> schemas.  I talked with Steve for a while trying to figure out what was going 
> on
> and we came to the opinion that there is either an issue with the TLOG 
> schemas,
> or (perhaps more likely) there is an error in the separatorSuppressionPolicy
> code when dealing with infix separators in Daffodil.
>
> In the TlogAce.xsd file
> (https://github.com/DFDLSchemas/IBM4690-TLOG/blob/master/ACE/TlogAce.xsd#L155)
> it seems that the way the schema and data files were written assumed that the
> IBM4690 packed format could have a valid zero length representation, ie an
> optional element that doesn't occur would just be an empty string surrounded 
> by
> separators.  While this works just fine for strings or hex binary that have
> valid zero length representations, for IBM4690 and other packed binary formats
> the associated separators aren't processed, and in the TlogAce.xsd file, when
> the element SpecialTime is missing all subsequent parsed data in the sequence
> become CustomUserField's as that is the only element that matches the 
> separators
> (I think).
>
> So, just wanted to get your opinion on whether or not this is an issue with 
> the
> current Daffodil separator suppression policy code or if this is a case of an
> incorrectly formed schema.  Steve may jump in to clarify anything I didn't
> explain correctly, as he is a bit more familiar with the separatorSuppression
> code in Daffodil.
>
> Thanks,
>
> Josh
>

Re: Issue with separatorSuppressionPolicy and empty elements in IBM4690-TLOG scheams

Reply via email to