I was the person who passed the original problem on to Roger. I was able to
get a simple case to work
by capturing the CRLF or LF combinations in hidden groups, not treating them
as
separators or terminators.

Trying this in a more sophisticated DFDL schema--the actual problem has
other fields
before and after the payload with the mixed linefeed combinations--caused
Daffodil
to infinite loop.  That's where I'm stuck.

Julian

---------------------------------------------------------------
Dr. Julian C. Lander
Lead Software Engineer
MITRE

Mail Stop M360
The MITRE Corporation
202 Burlington Road
Bedford, MA   01730-1420
781-271-4516 


-----Original Message-----
From: Costello, Roger L. <[email protected]> 
Sent: Monday, April 08, 2019 7:42 AM
To: [email protected]
Subject: Re: Bug in Daffodil?

Thanks Steve.

Is there a workaround? I need the output of unparsing to exactly match the
original input.

/Roger


-----Original Message-----
From: Steve Lawrence <[email protected]>
Sent: Friday, April 5, 2019 10:45 AM
To: [email protected]; Costello, Roger L. <[email protected]>
Subject: [EXT] Re: Bug in Daffodil?

This is actually the expected behavior, though it's maybe not always
desired.

The issue here is that XML is not allowed to contain CR's, only LF's are
allowed. So when we output infoset data, all CRLF's are converted to LF, and
any lone CR's are also converted to LF. Unfortunately, if your data fields
contains a CR, it's going to get lost. In a lot of cases this is fine, since
lots of formats don't care about CRLF vs LF. But there are definitely some
places where it matters.

DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One
option would be to convert CR character in the data to a private use area
like we do with other illegal XML characters, but that makes the infoset
less useful. Another option might be to say that whenever an LF appears in
the data, we just always unparse it as a CRLF. This means if your data mixes
CRLF and LF, we'd always output CRLF, but that's probably not a big deal if
mixing is allowed in the format.

- Steve

[1] https://issues.apache.org/jira/browse/DAFFODIL-1559

On 4/5/19 9:25 AM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> My input file consists of a prolog of known format and a payload 
> surrounded by parentheses. The payload consists of a series of text 
> fields separated by hyphens. In some cases, the hyphen can be preceded 
> by a new line, which can be a carriage return or CRLF combination.
> 
> Here is a sample input file; I show it in a hex editor so you can see 
> that some hyphens are preceded by CRLF and others by just a CR.
> 
> Here is my DFDL schema:
> 
> <xs:elementname="input">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
> <xs:complexType>
> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
> <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> 
> When I parse the input file using the DFDL schema, I get this XML:
> 
> <input>
> <prolog>PROLOG</prolog>
> <payload>
> <field>A</field>
> <field>B</field>
> <field>C
> </field>
> <field>D</field>
> <field>E
> </field>
> <field>F</field>
> </payload>
> </input>
> 
> That’s perfect.
> 
> When I unparse the XML I get this (please note the bug (?) described in
yellow):
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to