Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER E 
WITH ACUTE.


So using iso-8859-1 is going to do the wrong thing for sure.


So let's figure out why your data fails to parse when specifying the correct 
character set encoding, utf-8.


Your hex bytes as presented are all valid Utf-8 according to this site:

http://www.endmemo.com/unicode/unicodeconverter.php


So, maybe there's a utf-8 bug in daffodil?






________________________________
From: Costello, Roger L. <[email protected]>
Sent: Wednesday, October 10, 2018 9:59:16 AM
To: [email protected]
Subject: Why does Daffodil change the binary of non-ASCII characters?


Hello DFDL community,



I have a binary file that contains, among other things, this text:



Nova Scotia / Nouvelle-Écosse



Its corresponding hex binary is this:



4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
6F 73 73 65 20 …



I used this element declaration in my DFDL schema to parse that binary:



<xs:element     name="NAME"
                       type="xs:string"
                       dfdl:length="93"
                        dfdl:lengthKind="explicit"
                       dfdl:lengthUnits="characters"
                        dfdl:textTrimKind="padChar"
                        dfdl:textStringPadCharacter="%SP;"
                        dfdl:textStringJustification="center"/>



Surprisingly, during parsing Daffodil modified the text to this:



Nova Scotia / Nouvelle-Ã?cosse



With this corresponding hex binary:



4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 
6F 73 73 65 20 …



The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing).



Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode 
codepoint.



Why did Daffodil change the binary?



One other piece of the puzzle: in my DFDL schema I specify 
encoding="ISO-8859-1". For a reason I do not understand, when I specify 
encoding="utf-8" I get an error message on parse.



Please help!



/Roger


Reply via email to