Re: Incorrect delimiter scanning when mixed encodings?

Steve Lawrence Thu, 30 Apr 2020 12:47:24 -0700

As Brandon points out, this might give issues to non-byte size
encodings? I assume mandatory text alignment applies for these as well,
which I think deals with such issues. And I assume MTA still applies
even with raw byte entities.


Seems like the algorithm is something like:

  1) Set a mark
  2) Apply delimiter MTA
  3) Check if delimiter bits match current bits
  4) If not match, decode a single char in element encoding and record
  5) Repeat from 1

Re: escape characters, I think escape characters are always in the
encoding of the element of which they are applied since escapeSchemes do
not include a dfdl:encoding property?

Requires extra logic in the above, maybe even attempting the decode
before checking delimiters since escape char effectively disables
delimiter scanning, but I don't think anything too crazy.

I wonder if some of the DFA complexity goes away?



On 4/30/20 2:47 PM, Beckerle, Mike wrote:
> 
> The encoding for the delimiter is the encoding in effect on the schema 
> component carrying the property. Making them take on contextual encodings 
> makes things much too complicated.
> 
> So yeah, I think in your case, if we're scanning for that "§" but we're using 
> a decoder for ASCII, that's incorrect.
> 
> These mixed encoding cases are all corner cases anyway, so they don't have to 
> be natural or easy. The rules simply have to be easy to interpret.
> 
> So your root element defines a terminator.
> 
> That terminator's encoding has *nothing* to do with the encoding specified 
> for a contained element within root. It is not that contained element's 
> terminator, it is the root's terminator.
> 
> The semantics of delimiter scanning in DFDL is in fact something that 
> requires lowering the delimiters to byte patterns. This is required based on 
> mixed scenarios like this, but also based on features like Byte-Value 
> entities e.g., %#rHH; which specifies a hex byte that can appear, even in the 
> middle of characters, when that byte makes no sense.
> 
> <element name="foo" type="xs:int" dfdl:terminator="11%#r88;99" 
> dfdl:encoding="utf-16BE"/>
> 
> So the terminator of the above is bytes 0031 0031 88 0039 0039.  See how that 
> 88 is just thrown in there. Makes no sense in ANY encoding. We're even 
> screwing up the character alignment here.
> 
> That means scanning for delimiters in DFDL requires us to lower the scanning 
> to bytes.
> 
> Of course Daffodil doesn't implement %#rHH; byte-value (aka raw-bytes) 
> entities except for one special case which is specifying the fill byte 
> property.  And our scanning is currently character oriented.
> 
> So, what does it mean to scan for say a UTF-16 character '1' as terminator of 
> an element that is in say, ASCII ?
> 
> It means you are searching through the bytes ignoring ASCII, decoding it as 
> UTF-16, looking for '1' (which is bytes 00 31).  Then having found a 0031, 
> the preceding bytes are then decoded as ASCII.
> 
> Pretty sure Daffodil scanning isn't doing that.
> 
> 
> ________________________________
> From: Sloane, Brandon <bslo...@tresys.com>
> Sent: Thursday, April 30, 2020 12:01 PM
> To: dev@daffodil.apache.org <dev@daffodil.apache.org>
> Subject: Re: Incorrect delimiter scanning when mixed encodings?
> 
> Without looking at the spec, I would expect that delimiters be defined by the 
> encoding the the element that defines the delimeter; so Daffodil is buggy in 
> the case you describe. However, there are a couple of complications we have 
> to consider:
> 
> 1) What if instead of a terminator, we had a separator; and the separator is 
> a valid character in both encoding; but has a different bytecode
> 
>  <xs:element name="root" >
>        <xs:complexType>
>          <xs:sequence dfdl:separator=",">
>            <xs:element name="name" type="xs:string" maxOccurs="2" 
> encoding="FOO"/>
>            <xs:element name="address" type="xs:string" maxOccurs="2" 
> encoding="BAR"/>
>          </xs:sequence>
>        </xs:complexType>
>     </xs:element>
> 
> In this case, I would expect the separator to be interperated based on the 
> encoding of the individual elements, which is obviously not consistent with 
> my expectation from your example. There is also the instance of the separator 
> occurring between the two element types. So even in this case my naive 
> expectation is not consistent.
> The correct answer here is probably to say that this example schema is wrong, 
> and there should be 2 sequences, each defining their own separator.
> 
> 2) What if the encodings have a different alignment? For instance, if the 
> outer encoding that defines the delimiter is 8-bit and byte alligned, with a 
> 7-bit inner encoding, should we look forward to the next byte boundary after 
> every 7 bit character?
> 
> 3) How does this interact with escape sequences?
> 
> The solution here might be to think through some restrictions on where 
> encoding changes are allowed to occur. I am not sure it is possible to give 
> reasonable semantics for everything over a region that spans multiple 
> encodings.
> ________________________________
> From: Steve Lawrence <slawre...@apache.org>
> Sent: Thursday, April 30, 2020 11:15 AM
> To: dev@daffodil.apache.org <dev@daffodil.apache.org>
> Subject: Incorrect delimiter scanning when mixed encodings?
> 
> Say we have a schema like this:
> 
>   <xs:schema
>     xmlns:xs="http://www.w3.org/2001/XMLSchema";
>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";>
> 
>     <xs:include
> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />
> 
>     <xs:annotation>
>       <xs:appinfo source="http://www.ogf.org/dfdl/";>
>         <dfdl:format ref="GeneralFormat" lengthKind="delimited"
>           encoding="ISO-8859-1" />
>       </xs:appinfo>
>     </xs:annotation>
> 
>      <xs:element name="root" dfdl:terminator="§">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:element name="name" type="xs:string" />
>          </xs:sequence>
>        </xs:complexType>
>     </xs:element>
> 
>   </xs:schema>
> 
> So we have a format that is all ISO-8859-1, and a delimited string
> called "name", and the root is terminated by "§" in the ISO-8859-1
> encoding. If we have data that looks like this:
> 
>   text§
> 
> It will parse to this:
> 
>   <root>
>     <name>text</name>
>   </root>
> 
> Now say we want just the "name" element to have a different encoding, so
> we change it to this:
> 
>   <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" />
> 
> Now the terminator defined on the root element is in a different
> encoding than the delimited element. Note that the terminator § isn't
> even valid in this encoding.
> 
> Currently, Daffodil does not successfully parse this. It scans the data,
> decoding a single character at a time looking for a delimiter.
> Eventually it gets to the § ata, the decoder says it's not valid in our
> ASCII encoding and converts it to the unicode replacement character.
> This of course doesn't match the delimiter we're looking for and
> continue on. The delimeter scanner then hits the end of data, and errors
> when it never finds the root termiantor.
> 
> Is this the correct behavior, or is our delimiter scanning fundamentally
> broken?
> 
> I wonder if the correct behavior is when the terminator comes into scope
> we should immediately encode it into its bytes. Delimiter scanning only
> looks for these bytes and doesn't actually decode any data. And only
> when bytes that match a delimiter are found do we decode all the bytes
> up until that point?
> 
> Is this reasonable, or is this type of thing just not allowed?
> 
> 
> Note that this is somewhat hypothetical. I don't know of any formats
> that mix encodings like this, but this popped into my head looking at
> DAFFODIL-2323 which complains about encoding property and a sequence,
> which can have a siilar issue if the sequence encoding differs from
> element encodings.
>

Re: Incorrect delimiter scanning when mixed encodings?

Reply via email to