Without looking at the spec, I would expect that delimiters be defined by the encoding the the element that defines the delimeter; so Daffodil is buggy in the case you describe. However, there are a couple of complications we have to consider:
1) What if instead of a terminator, we had a separator; and the separator is a valid character in both encoding; but has a different bytecode <xs:element name="root" > <xs:complexType> <xs:sequence dfdl:separator=","> <xs:element name="name" type="xs:string" maxOccurs="2" encoding="FOO"/> <xs:element name="address" type="xs:string" maxOccurs="2" encoding="BAR"/> </xs:sequence> </xs:complexType> </xs:element> In this case, I would expect the separator to be interperated based on the encoding of the individual elements, which is obviously not consistent with my expectation from your example. There is also the instance of the separator occurring between the two element types. So even in this case my naive expectation is not consistent. The correct answer here is probably to say that this example schema is wrong, and there should be 2 sequences, each defining their own separator. 2) What if the encodings have a different alignment? For instance, if the outer encoding that defines the delimiter is 8-bit and byte alligned, with a 7-bit inner encoding, should we look forward to the next byte boundary after every 7 bit character? 3) How does this interact with escape sequences? The solution here might be to think through some restrictions on where encoding changes are allowed to occur. I am not sure it is possible to give reasonable semantics for everything over a region that spans multiple encodings. ________________________________ From: Steve Lawrence <slawre...@apache.org> Sent: Thursday, April 30, 2020 11:15 AM To: dev@daffodil.apache.org <dev@daffodil.apache.org> Subject: Incorrect delimiter scanning when mixed encodings? Say we have a schema like this: <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"> <xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" /> <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:format ref="GeneralFormat" lengthKind="delimited" encoding="ISO-8859-1" /> </xs:appinfo> </xs:annotation> <xs:element name="root" dfdl:terminator="§"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string" /> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> So we have a format that is all ISO-8859-1, and a delimited string called "name", and the root is terminated by "§" in the ISO-8859-1 encoding. If we have data that looks like this: text§ It will parse to this: <root> <name>text</name> </root> Now say we want just the "name" element to have a different encoding, so we change it to this: <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" /> Now the terminator defined on the root element is in a different encoding than the delimited element. Note that the terminator § isn't even valid in this encoding. Currently, Daffodil does not successfully parse this. It scans the data, decoding a single character at a time looking for a delimiter. Eventually it gets to the § ata, the decoder says it's not valid in our ASCII encoding and converts it to the unicode replacement character. This of course doesn't match the delimiter we're looking for and continue on. The delimeter scanner then hits the end of data, and errors when it never finds the root termiantor. Is this the correct behavior, or is our delimiter scanning fundamentally broken? I wonder if the correct behavior is when the terminator comes into scope we should immediately encode it into its bytes. Delimiter scanning only looks for these bytes and doesn't actually decode any data. And only when bytes that match a delimiter are found do we decode all the bytes up until that point? Is this reasonable, or is this type of thing just not allowed? Note that this is somewhat hypothetical. I don't know of any formats that mix encodings like this, but this popped into my head looking at DAFFODIL-2323 which complains about encoding property and a sequence, which can have a siilar issue if the sequence encoding differs from element encodings.