Re: unparsing is converting carriage return to line feed?

Mike Beckerle Mon, 02 Jun 2025 12:04:02 -0700

Parsing with iso-8859-1 preserves all bytes from native form into the DFDL
Infoset.

But ... then Daffodil is projecting the DFDL infoset into XML.

It is this XML conversion step that is causing the problem.

XML reading does not preserve CRLFs. On input XML readers convert CRLF->LF,
and stand alone CR to LF also.
Your data has CRCRLF so that becomes two LFs.

(This is one of several reasons why, in hindsight, XML isn't a very good
data language. I.e., it's not just that it is verbose!)

Unlike the illegal XML characters, which we have no choice but to remap
into the Unicode private use area (aka PUA) (as detailed here:
https://daffodil.apache.org/infoset/ See heading "XML Illegal Characters"),
Daffodil really does need a "preserveCR" flag of some kind, as CR isn't
technically an "illegal character" in XML data.

The workaround I have used and suggested in the past is to model a string
which can contain CR as an array of strings separated by CR.

On Mon, Jun 2, 2025 at 2:29 PM Mark Kozak <[email protected]> wrote:

> Hello folks,
>
>
>
> Section 11.2.3 of the documentation says that if I use the ISO-8859-1
> encoding, all bytes will be preserved.
>
> So I have a simple text file that has the following text, represented as
> hex:
>
>
>
> Using the following schema, I get the expected xml on parse
>
>
>
>   <element name="file">
>
>     <complexType>
>
>       <sequence >
>
>         <element name="file_string" type="xs:string" dfdl:lengthKind =
> "delimited" dfdl:encoding="ISO-8859-1"/>
>
>       </sequence>
>
>     </complexType>
>
>   </element>
>
>
>
> But when unparsing, one 0D is dropped, and one is converted to 0A as shown
> below:
>
>
>
> What am I missing to actually preserve all bytes?
>
>
>
> Thanks,
>
> Mark
>
>
>
> Mark Kozak
>
> Director of Engineering
>
> Adeptus Cyber Solutions
>
> Adeptus-CS.com
>
>
>

Re: unparsing is converting carriage return to line feed?

Reply via email to