Incorrect delimiter scanning when mixed encodings?

Steve Lawrence Thu, 30 Apr 2020 08:15:30 -0700

Say we have a schema like this:

  <xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema";
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";>


    <xs:include
schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

    <xs:annotation>
      <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:format ref="GeneralFormat" lengthKind="delimited"
          encoding="ISO-8859-1" />
      </xs:appinfo>
    </xs:annotation>

     <xs:element name="root" dfdl:terminator="§">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="name" type="xs:string" />
         </xs:sequence>
       </xs:complexType>
    </xs:element>

  </xs:schema>

So we have a format that is all ISO-8859-1, and a delimited string
called "name", and the root is terminated by "§" in the ISO-8859-1
encoding. If we have data that looks like this:

  text§

It will parse to this:

  <root>
    <name>text</name>
  </root>

Now say we want just the "name" element to have a different encoding, so
we change it to this:

  <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" />

Now the terminator defined on the root element is in a different
encoding than the delimited element. Note that the terminator § isn't
even valid in this encoding.

Currently, Daffodil does not successfully parse this. It scans the data,
decoding a single character at a time looking for a delimiter.
Eventually it gets to the § ata, the decoder says it's not valid in our
ASCII encoding and converts it to the unicode replacement character.
This of course doesn't match the delimiter we're looking for and
continue on. The delimeter scanner then hits the end of data, and errors
when it never finds the root termiantor.

Is this the correct behavior, or is our delimiter scanning fundamentally
broken?

I wonder if the correct behavior is when the terminator comes into scope
we should immediately encode it into its bytes. Delimiter scanning only
looks for these bytes and doesn't actually decode any data. And only
when bytes that match a delimiter are found do we decode all the bytes
up until that point?

Is this reasonable, or is this type of thing just not allowed?


Note that this is somewhat hypothetical. I don't know of any formats
that mix encodings like this, but this popped into my head looking at
DAFFODIL-2323 which complains about encoding property and a sequence,
which can have a siilar issue if the sequence encoding differs from
element encodings.

Incorrect delimiter scanning when mixed encodings?

Reply via email to