Thanks Mike,
With a little massaging, I was able to get this concept working for me.

From: Mike Beckerle <[email protected]>
Sent: Monday, May 20, 2024 3:58 PM
To: [email protected]
Subject: Re: String Parsing Question

Here is a workaround. The only thing I can think of is this:

Any string that can contain CR must be modeled as an array of strings separated 
by CR.

So instead of just a single string:

<xs:element name="name" type="xs:string"
  dfdl:lengthUnits="bytes" dfdl:lengthKind="explicit"
  dfdl:length="16" dfdl:encoding="US-ASCII" dfdl:alignmentUnits="bytes" />

You need:

<xs:element name="name" dfdl:lengthKind="explicit" dfdl:length="16"
    type="prefix:StringWithCR" .../>

   <xs:complexType name="StringWithCR">
      <!--
        Converting text data to XML does not preserve CR characters.
        By treating CR as a delimiter between parts of a string, we can
        preserve the CR.
        -->
      <xs:sequence dfdl:separator="%CR;" dfdl:encoding="US-ASCII">
          <element name="part" type="xs:string" maxOccurs="16"
              dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit"
              dfdl:encoding="US-ASCII"/>
     </xs:sequence>
  </xs:complexType>

I don't feel it should be necessary. We should let users just say preserveCR.

Arguably, arrays of strings are helpful generally vs. regular strings because
they not only deal with the CR issue, but can help with too-long strings 
problem as well.
XML and XML parsers were not designed with megabyte+ sized strings in mind.

On Mon, May 20, 2024 at 3:42 PM Mike Beckerle 
<[email protected]<mailto:[email protected]>> wrote:
Actually, is this a bug?

The page about XML Illegal characters says:

The legal character #xD (Carriage Return or CR) is mapped to #xA (Line Feed, or 
LF). The CR character is allowed in the textual representation of XML 
documents, but is always converted to LF in the XML Infoset. That is, it is 
read by XML processors, but CRLF is converted to just LF, and CR alone is 
converted to LF. Daffodil is in a sense a different 'reader' of data into the 
XML infoset, so to be consistent with XML we map CR and CRLF to LF.

I get the rationale for this, but I had forgotten this rule, and I'm the one 
who made the changes to implement this rule.

But, given this rule, if you need the legal US-ASCII characters of a string to 
be preserved, including all the C0 control characters, including CR, how can 
that be achieved?
Any character that is not preserved by XML parsing needs to be preserved 
somehow.


On Mon, May 20, 2024 at 3:22 PM Mike Beckerle 
<[email protected]<mailto:[email protected]>> wrote:
Yup. Looks like I introduced this bug when revising this code back in Jan 2023.

On Mon, May 20, 2024 at 3:10 PM Steve Lawrence 
<[email protected]<mailto:[email protected]>> wrote:
It likes our XMLTextInfosetOutputter uses remapXMLIllegalCharactersToPUA:

https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/infoset/XMLTextInfosetOutputter.scala#L192

And that enables replacCRWithLF so does not preserve CRLF or CR:

https://github.com/apache/daffodil/blob/main/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala#L159-L160

That might explain when 0D is not being preserved and unparsing as 0A, because
it's 0A in the infoset.

I thought we used to preserve those, so this might be a regression? Looks like
maybe our infoset outputters should use xmlRemapperPreservingCR?


On 2024-05-20 03:01 PM, Mike Beckerle wrote:
> Hmmm. Data is fixed length so %NL; and dfdl:outputNewLine property aren't 
> involved.
>
> XML doesn't preserve CR naturally. It converts CRLF or lone CR into LF. That's
> standard XML parser behavior, nothing to do with DFDL.
>
> Daffodil's XML conversion from the DFDL Infoset preserves CR by remapping them
> into the Uncode Private Use Area (PUA), which it does by adding 0xE000 to the
> character code roughly.
> This happens for all the C0 control characters, so starting at byte 11 you 
> have
> 00 0C 0C 0D 01, and all 5 of those get remapped, to
> E000, E00C, E00C, E00D, E001 characters.  That explains all the EE 80 80 
> (UTF-8
> for E000) bytes you see in the XML text, which is UTF-8.
>
> (See The section "XML Illegal Characters" in
> https://daffodil.apache.org/infoset/ <https://daffodil.apache.org/infoset/>)
>
> This is inverted on output. You should get back 00 0C 0C 0D 01.
>
> But in the UTF-8, you don't have an EE 80 0D which would be E00D, you have 
> just 0A.
>
> Was this string of test data held in an XML file before the test? Somehow that
> 0D got converted to 0A before Daffodil ever saw the byte, because Daffodil 
> would
> have created E00D from it, and the bytes would be EE 80 0D in the UTF-8.
>
> On Mon, May 20, 2024 at 2:35 PM Larry Barber 
> <[email protected]<mailto:[email protected]>
> <mailto:[email protected]<mailto:[email protected]>>> wrote:
>
>     I have a string parsing issue that I’ve replicated in a very simple schema
>     (attached) that is basically is nothing but the a 16 character string:____
>
>     __ __
>
>            <xs:element name="TEST">____
>
>                  <xs:complexType>____
>
>                        <xs:sequence>____
>
>                      <xs:element name="name" type="xs:string" ____
>
>                          dfdl:lengthUnits="bytes" dfdl:lengthKind="explicit"
>     dfdl:length="16" dfdl:encoding="US-ASCII" dfdl:alignmentUnits="bytes" 
> />____
>
>                        </xs:sequence>____
>
>                  </xs:complexType>____
>
>            </xs:element>____
>
>     __ __
>
>     The input looks like this:____
>
>     ____
>
>     After parsing & unparsing, my output file looks like this:____
>
>     ____
>
>     The <CR> at location 0x0E has been transformed into a <LF>!____
>
>     __ __
>
>     The infoset produced by the parse shows strangeness that I would not 
> expect:____
>
>     ____
>
>     __ __
>
>     I’ve tried a variety of dfdl:encodingsettings and get the same results 
> with
>     US_ASCII, ASCII, and ISO-8859-1.____
>
>     Maybe, it’s somehow related to the outputNewLine="%CR;%LF;"or some other
>     obscure string setting that I’ve missed?____
>
>     Daffodil version 3.7.0____
>
>     __ __
>

Reply via email to