Re: Why does Daffodil change the binary of non-ASCII characters?

Mike Beckerle Wed, 10 Oct 2018 08:25:05 -0700

Interesting,


So that error says it is looking for 80 utf-8 characters, not 80 bytes.


This is a supported behavior, but not typically what people want. Usually in 
legacy formats (like dbase) lengths are in bytes.


If you have lengthUnits='characters' in iso-8859-1 that's identical to bytes, 
but in utf8 it is clearly not the same as bytes.


Try lengthUnits="bytes".


________________________________
From: Costello, Roger L. <[email protected]>
Sent: Wednesday, October 10, 2018 11:21:17 AM
To: [email protected]
Subject: RE: Why does Daffodil change the binary of non-ASCII characters?


Hi Mike,



Below is the error message that I get when I change encoding to utf-8 (i.e., 
encoding="utf-8"). Does that help narrow down the possible problem?  /Roger



[error] Parse Error: Failed to populate record[1832]. Cause: Parse Error: 
<SpecifiedLengthExplicitCharactersParser><STATEABB 
parser='StringOfSpecifiedLengthParser' 
/></SpecifiedLengthExplicitCharactersParser> - STATEABB - Parse failed.  Failed 
to find exactly 80 characters.

Schema context: STATEABB Location line 115 column 42 in dBase.dfdl.xsd

Data location was preceding byte 652456.

Schema context: sequence Location line 81 column 26 in dBase.dfdl.xsd

Data location was preceding byte 652456





From: Mike Beckerle <[email protected]>
Sent: Wednesday, October 10, 2018 11:03 AM
To: [email protected]
Subject: Re: Why does Daffodil change the binary of non-ASCII characters?



Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER E 
WITH ACUTE.



So using iso-8859-1 is going to do the wrong thing for sure.



So let's figure out why your data fails to parse when specifying the correct 
character set encoding, utf-8.



Your hex bytes as presented are all valid Utf-8 according to this site:

http://www.endmemo.com/unicode/unicodeconverter.php



So, maybe there's a utf-8 bug in daffodil?











________________________________

From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Wednesday, October 10, 2018 9:59:16 AM
To: [email protected]<mailto:[email protected]>
Subject: Why does Daffodil change the binary of non-ASCII characters?



Hello DFDL community,



I have a binary file that contains, among other things, this text:



Nova Scotia / Nouvelle-Écosse



Its corresponding hex binary is this:



4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
6F 73 73 65 20 …



I used this element declaration in my DFDL schema to parse that binary:



<xs:element     name="NAME"
                       type="xs:string"
                       dfdl:length="93"
                        dfdl:lengthKind="explicit"
                       dfdl:lengthUnits="characters"
                        dfdl:textTrimKind="padChar"
                        dfdl:textStringPadCharacter="%SP;"
                        dfdl:textStringJustification="center"/>



Surprisingly, during parsing Daffodil modified the text to this:



Nova Scotia / Nouvelle-Ã?cosse



With this corresponding hex binary:



4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 
6F 73 73 65 20 …



The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing).



Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode 
codepoint.



Why did Daffodil change the binary?



One other piece of the puzzle: in my DFDL schema I specify 
encoding="ISO-8859-1". For a reason I do not understand, when I specify 
encoding="utf-8" I get an error message on parse.



Please help!



/Roger

Re: Why does Daffodil change the binary of non-ASCII characters?

Reply via email to