Agreed, sounds like a bug with UTF-8 decoding, though I am unable cause
any failures on UTF-8 with Daffodil 2.2.0.

Regarding the Ã?, under ISO-8859-1, each byte is represented as a
character. The byte 0xC3 is à and there is no representation for 0x89,
so it often just shows up as a question mark when displayed. So that
explain why you see Ã? when using ISO-8859-1 encoding. Though, I can't
reproduce the 0x89 being converted to 0x3F. It almost seems like
something doesn't like the 0x89 (which seems reasonable since there is
not visual representation of it) and converts it to a question mark,
which is 0x3F and is a standard replacement character. I wonder if
Windows is doing a translation, since I don't see this problem on Linux...

- Steve

On 10/10/18 11:03 AM, Mike Beckerle wrote:
> Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER 
> E 
> WITH ACUTE.
> 
> 
> So using iso-8859-1 is going to do the wrong thing for sure.
> 
> 
> So let's figure out why your data fails to parse when specifying the correct 
> character set encoding, utf-8.
> 
> 
> Your hex bytes as presented are all valid Utf-8 according to this site:
> 
> http://www.endmemo.com/unicode/unicodeconverter.php
> 
> 
> So, maybe there's a utf-8 bug in daffodil?
> 
> 
> 
> 
> 
> 
> --------------------------------------------------------------------------------
> *From:* Costello, Roger L. <[email protected]>
> *Sent:* Wednesday, October 10, 2018 9:59:16 AM
> *To:* [email protected]
> *Subject:* Why does Daffodil change the binary of non-ASCII characters?
> 
> Hello DFDL community,
> 
> I have a binary file that contains, among other things, this text:
> 
> Nova Scotia / Nouvelle-Écosse
> 
> Its corresponding hex binary is this:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
> 6F 
> 73 73 65 20 …
> 
> I used this element declaration in my DFDL schema to parse that binary:
> 
> <xs:element    name="NAME"
>                         type="xs:string"
>                         dfdl:length="93"
>                          dfdl:lengthKind="explicit"
>                         dfdl:lengthUnits="characters"
>                          dfdl:textTrimKind="padChar"
>                          dfdl:textStringPadCharacter="%SP;"
>                          dfdl:textStringJustification="center"/>
> 
> Surprisingly, during parsing Daffodil modified the text to this:
> 
> Nova Scotia / Nouvelle-Ã?cosse
> 
> With this corresponding hex binary:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 
> 6F 
> 73 73 65 20 …
> 
> The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing).
> 
> Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode 
> codepoint.
> 
> Why did Daffodil change the binary?
> 
> One other piece of the puzzle: in my DFDL schema I specify 
> encoding="ISO-8859-1". For a reason I do not understand, when I 
> specifyencoding="utf-8" I get an error message on parse.
> 
> Please help!
> 
> /Roger
> 

Reply via email to