Thank you Mike - that is a fantastic solution.
But, but, but ....
I've got one tiny problem: what about the right parenthesis at the end? The
input starts with a left parenthesis and ends with a right parenthesis, e.g.,
PROLOG
(A-B-C
-D-E
-F)
Notice that the last field is F and it is not followed by dash, CR/dash, nor
CRLF/dash.
What to do?
Below is my schema. I am getting this error:
[error] Parse Error: Failed to populate field[1]. Cause: Parse Error: All
choice alternatives failed. Reason(s): List(Parse Error: Alternative failed.
Reason(s): List(Parse Error: Found out of scope delimiter: ')' ')'
<xs:element name="input">
<xs:complexType>
<xs:sequence>
<xs:element name="prolog" type="xs:string" dfdl:terminator="%NL;" />
<xs:element name="payload" dfdl:initiator="(" dfdl:terminator=")">
<xs:complexType>
<xs:sequence>
<xs:element name="field" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="value" type="xs:string"
dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|$))" />
<xs:choice>
<xs:element name="crDash"
dfdl:initiator="%CR;-" type="xs:string" />
<xs:element name="crlfDash"
dfdl:initiator="%CR;%LF;-" type="xs:string" />
<xs:element name="dash"
dfdl:initiator="-" type="xs:string" />
</xs:choice>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
From: Beckerle, Mike <[email protected]>
Sent: Wednesday, April 10, 2019 8:49 AM
To: [email protected]
Subject: [EXT] Re: Bug in Daffodil?
I'm late in trying to interact with this thread, I fear I may have missed some
messages.
But here goes....
DFDL does not capture and preserve which of multiple possible delimiters was
used. When there is more than one possible delimiter, the first one is
considered canonical and is used for unparsing, even though when parsing, the
longest match is used.
So if you have dfdl:terminator="- %CR;- %CR;%LF;-" then on unparsing you'll
always unparse these as just "-" no CR or CRLF will *ever* be output regardless
of what was found when parsing.
That means if you round-trip data (parse then unparse) it will canonicalize the
delimiters. What you get out is considered a canonical form of the data. This
will usually NOT get back the exact same output as input, but you will get what
the DFDL schema specifies is an equivalent canonical form. If you parse this
again, the infoset you get should be the same as the infoset from the first
parse. This is what we call a "twoPass" round trip.
If that isn't the behavior you want, because it is significant and important in
the format exactly which delimiters were used, then the delimiters are not just
delimiters. They are carrying some additional information/significance that
must be captured by the Infoset in order for the DFDL schema to accurately
represent the information content.
To do that you must do what I call modeling syntax as data. That is, you must
capture the specific delimiters in elements so that the significance of which
specific delimiter was used is captured in the infoset.
Suppose your delimiter is either "CR-", "CRLF-", or just "-".
To parse an element delimited by this and capture which delimiter specifically
was found, you must use dfdl:lengthKind='pattern' and regular expressions with
lookahead:
<element name="foo" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-))" ....>
This matches text up to, but not including one of the CR-, CRLF-, or just -
patterns using the regex forward-lookahead feature.
This element is then followed by
<choice>
<element name="crDash" dfdl:initiator="%CR;-" ..../>
<element name="crlfDash" dfdl:initiator="%CR;%LF;-" .../>
<element name="dash" dfdl:initiator="-" .../>
</choice>
In each choice branch above, the element is a string of explicit length 0.
This will parse, and unparse just fine. You'll get infosets like
<foo>contents</foo><crlfDash/>
That <crlfDash/> element indicates which delimiter specifically was found and
should be laid down after the <foo>contents</foo> when unparsing.
The above technique will not run into the DAFFODIL-1559 bug, because the CR
characters are never brought into the XML Infoset, so are never converted into
LF.
Note that you cannot put the choice above into a hidden group so as to hide
this delimiter cruft. Because then that information would be lost and
unavailable for unparsing.
I hope that helps.
________________________________
From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Monday, April 8, 2019 7:41 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Bug in Daffodil?
Thanks Steve.
Is there a workaround? I need the output of unparsing to exactly match the
original input.
/Roger
-----Original Message-----
From: Steve Lawrence <[email protected]<mailto:[email protected]>>
Sent: Friday, April 5, 2019 10:45 AM
To: [email protected]<mailto:[email protected]>; Costello,
Roger L. <[email protected]<mailto:[email protected]>>
Subject: [EXT] Re: Bug in Daffodil?
This is actually the expected behavior, though it's maybe not always desired.
The issue here is that XML is not allowed to contain CR's, only LF's are
allowed. So when we output infoset data, all CRLF's are converted to LF, and
any lone CR's are also converted to LF. Unfortunately, if your data fields
contains a CR, it's going to get lost. In a lot of cases this is fine, since
lots of formats don't care about CRLF vs LF. But there are definitely some
places where it matters.
DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One option
would be to convert CR character in the data to a private use area like we do
with other illegal XML characters, but that makes the infoset less useful.
Another option might be to say that whenever an LF appears in the data, we just
always unparse it as a CRLF. This means if your data mixes CRLF and LF, we'd
always output CRLF, but that's probably not a big deal if mixing is allowed in
the format.
- Steve
[1] https://issues.apache.org/jira/browse/DAFFODIL-1559
On 4/5/19 9:25 AM, Costello, Roger L. wrote:
> Hello DFDL community,
>
> My input file consists of a prolog of known format and a payload
> surrounded by parentheses. The payload consists of a series of text
> fields separated by hyphens. In some cases, the hyphen can be preceded
> by a new line, which can be a carriage return or CRLF combination.
>
> Here is a sample input file; I show it in a hex editor so you can see
> that some hyphens are preceded by CRLF and others by just a CR.
>
> Here is my DFDL schema:
>
> <xs:elementname="input">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
> <xs:complexType>
> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
> <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
>
> When I parse the input file using the DFDL schema, I get this XML:
>
> <input>
> <prolog>PROLOG</prolog>
> <payload>
> <field>A</field>
> <field>B</field>
> <field>C
> </field>
> <field>D</field>
> <field>E
> </field>
> <field>F</field>
> </payload>
> </input>
>
> That's perfect.
>
> When I unparse the XML I get this (please note the bug (?) described in
> yellow):
>