As Brandon points out, this might give issues to non-byte size encodings? I assume mandatory text alignment applies for these as well, which I think deals with such issues. And I assume MTA still applies even with raw byte entities.
Seems like the algorithm is something like: 1) Set a mark 2) Apply delimiter MTA 3) Check if delimiter bits match current bits 4) If not match, decode a single char in element encoding and record 5) Repeat from 1 Re: escape characters, I think escape characters are always in the encoding of the element of which they are applied since escapeSchemes do not include a dfdl:encoding property? Requires extra logic in the above, maybe even attempting the decode before checking delimiters since escape char effectively disables delimiter scanning, but I don't think anything too crazy. I wonder if some of the DFA complexity goes away? On 4/30/20 2:47 PM, Beckerle, Mike wrote: > > The encoding for the delimiter is the encoding in effect on the schema > component carrying the property. Making them take on contextual encodings > makes things much too complicated. > > So yeah, I think in your case, if we're scanning for that "§" but we're using > a decoder for ASCII, that's incorrect. > > These mixed encoding cases are all corner cases anyway, so they don't have to > be natural or easy. The rules simply have to be easy to interpret. > > So your root element defines a terminator. > > That terminator's encoding has *nothing* to do with the encoding specified > for a contained element within root. It is not that contained element's > terminator, it is the root's terminator. > > The semantics of delimiter scanning in DFDL is in fact something that > requires lowering the delimiters to byte patterns. This is required based on > mixed scenarios like this, but also based on features like Byte-Value > entities e.g., %#rHH; which specifies a hex byte that can appear, even in the > middle of characters, when that byte makes no sense. > > <element name="foo" type="xs:int" dfdl:terminator="11%#r88;99" > dfdl:encoding="utf-16BE"/> > > So the terminator of the above is bytes 0031 0031 88 0039 0039. See how that > 88 is just thrown in there. Makes no sense in ANY encoding. We're even > screwing up the character alignment here. > > That means scanning for delimiters in DFDL requires us to lower the scanning > to bytes. > > Of course Daffodil doesn't implement %#rHH; byte-value (aka raw-bytes) > entities except for one special case which is specifying the fill byte > property. And our scanning is currently character oriented. > > So, what does it mean to scan for say a UTF-16 character '1' as terminator of > an element that is in say, ASCII ? > > It means you are searching through the bytes ignoring ASCII, decoding it as > UTF-16, looking for '1' (which is bytes 00 31). Then having found a 0031, > the preceding bytes are then decoded as ASCII. > > Pretty sure Daffodil scanning isn't doing that. > > > ________________________________ > From: Sloane, Brandon <bslo...@tresys.com> > Sent: Thursday, April 30, 2020 12:01 PM > To: dev@daffodil.apache.org <dev@daffodil.apache.org> > Subject: Re: Incorrect delimiter scanning when mixed encodings? > > Without looking at the spec, I would expect that delimiters be defined by the > encoding the the element that defines the delimeter; so Daffodil is buggy in > the case you describe. However, there are a couple of complications we have > to consider: > > 1) What if instead of a terminator, we had a separator; and the separator is > a valid character in both encoding; but has a different bytecode > > <xs:element name="root" > > <xs:complexType> > <xs:sequence dfdl:separator=","> > <xs:element name="name" type="xs:string" maxOccurs="2" > encoding="FOO"/> > <xs:element name="address" type="xs:string" maxOccurs="2" > encoding="BAR"/> > </xs:sequence> > </xs:complexType> > </xs:element> > > In this case, I would expect the separator to be interperated based on the > encoding of the individual elements, which is obviously not consistent with > my expectation from your example. There is also the instance of the separator > occurring between the two element types. So even in this case my naive > expectation is not consistent. > The correct answer here is probably to say that this example schema is wrong, > and there should be 2 sequences, each defining their own separator. > > 2) What if the encodings have a different alignment? For instance, if the > outer encoding that defines the delimiter is 8-bit and byte alligned, with a > 7-bit inner encoding, should we look forward to the next byte boundary after > every 7 bit character? > > 3) How does this interact with escape sequences? > > The solution here might be to think through some restrictions on where > encoding changes are allowed to occur. I am not sure it is possible to give > reasonable semantics for everything over a region that spans multiple > encodings. > ________________________________ > From: Steve Lawrence <slawre...@apache.org> > Sent: Thursday, April 30, 2020 11:15 AM > To: dev@daffodil.apache.org <dev@daffodil.apache.org> > Subject: Incorrect delimiter scanning when mixed encodings? > > Say we have a schema like this: > > <xs:schema > xmlns:xs="http://www.w3.org/2001/XMLSchema" > xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"> > > <xs:include > schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" /> > > <xs:annotation> > <xs:appinfo source="http://www.ogf.org/dfdl/"> > <dfdl:format ref="GeneralFormat" lengthKind="delimited" > encoding="ISO-8859-1" /> > </xs:appinfo> > </xs:annotation> > > <xs:element name="root" dfdl:terminator="§"> > <xs:complexType> > <xs:sequence> > <xs:element name="name" type="xs:string" /> > </xs:sequence> > </xs:complexType> > </xs:element> > > </xs:schema> > > So we have a format that is all ISO-8859-1, and a delimited string > called "name", and the root is terminated by "§" in the ISO-8859-1 > encoding. If we have data that looks like this: > > text§ > > It will parse to this: > > <root> > <name>text</name> > </root> > > Now say we want just the "name" element to have a different encoding, so > we change it to this: > > <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" /> > > Now the terminator defined on the root element is in a different > encoding than the delimited element. Note that the terminator § isn't > even valid in this encoding. > > Currently, Daffodil does not successfully parse this. It scans the data, > decoding a single character at a time looking for a delimiter. > Eventually it gets to the § ata, the decoder says it's not valid in our > ASCII encoding and converts it to the unicode replacement character. > This of course doesn't match the delimiter we're looking for and > continue on. The delimeter scanner then hits the end of data, and errors > when it never finds the root termiantor. > > Is this the correct behavior, or is our delimiter scanning fundamentally > broken? > > I wonder if the correct behavior is when the terminator comes into scope > we should immediately encode it into its bytes. Delimiter scanning only > looks for these bytes and doesn't actually decode any data. And only > when bytes that match a delimiter are found do we decode all the bytes > up until that point? > > Is this reasonable, or is this type of thing just not allowed? > > > Note that this is somewhat hypothetical. I don't know of any formats > that mix encodings like this, but this popped into my head looking at > DAFFODIL-2323 which complains about encoding property and a sequence, > which can have a siilar issue if the sequence encoding differs from > element encodings. >