RE: Why does Daffodil change the binary of non-ASCII characters?

Costello, Roger L. Wed, 10 Oct 2018 09:31:52 -0700

Hi Mike,

Okay, per your suggestion I set encoding="utf-8" and in the element declaration 
for NAME, I changed dfdl:lengthUnits="characters" to dfdl:lengthUnits="bytes". 
Here's the element declaration:


<xs:element     name="NAME"
type="xs:string"
dfdl:length="93"
dfdl:lengthKind="explicit"
dfdl:lengthUnits="bytes"
dfdl:textTrimKind="padChar"
dfdl:textStringPadCharacter="%SP;"
dfdl:textStringJustification="center"/>

Here are the set of bytes before parsing:

4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
6F 73 73 65 20 20 ...

Here are the set of bytes after parsing:

4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C9 63 6F 
73 73 65

The changes are shown in yellow.

That change to the element declaration has triggered other problems.

Here is text in the original binary file:

Nuevo León

Here is its binary:

4E 75 65 76 6F 20 4C 65 C3 B3 6E

Here is the XML that parsing generates:

<NAME>Nuevo Leó®¼¯NAME>

Here is the binary:

3C 4E 41 4D 45 3E 4E 75 65 76 6F 20 4C 65 F3 AE BC AF 4E 41 4D 45 3E

The part in grey corresponds to the data. The output data is the same
as the input data up to hex 65 and then something strange happens.
You can see that the end tag </NAME> got mangled.

Thoughts?

/Roger

From: Mike Beckerle <[email protected]>
Sent: Wednesday, October 10, 2018 11:24 AM
To: [email protected]
Subject: Re: Why does Daffodil change the binary of non-ASCII characters?


Interesting,



So that error says it is looking for 80 utf-8 characters, not 80 bytes.



This is a supported behavior, but not typically what people want. Usually in 
legacy formats (like dbase) lengths are in bytes.



If you have lengthUnits='characters' in iso-8859-1 that's identical to bytes, 
but in utf8 it is clearly not the same as bytes.



Try lengthUnits="bytes".



________________________________
From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Wednesday, October 10, 2018 11:21:17 AM
To: [email protected]<mailto:[email protected]>
Subject: RE: Why does Daffodil change the binary of non-ASCII characters?


Hi Mike,



Below is the error message that I get when I change encoding to utf-8 (i.e., 
encoding="utf-8"). Does that help narrow down the possible problem?  /Roger



[error] Parse Error: Failed to populate record[1832]. Cause: Parse Error: 
<SpecifiedLengthExplicitCharactersParser><STATEABB 
parser='StringOfSpecifiedLengthParser' 
/></SpecifiedLengthExplicitCharactersParser> - STATEABB - Parse failed.  Failed 
to find exactly 80 characters.

Schema context: STATEABB Location line 115 column 42 in dBase.dfdl.xsd

Data location was preceding byte 652456.

Schema context: sequence Location line 81 column 26 in dBase.dfdl.xsd

Data location was preceding byte 652456





From: Mike Beckerle <[email protected]<mailto:[email protected]>>
Sent: Wednesday, October 10, 2018 11:03 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Why does Daffodil change the binary of non-ASCII characters?



Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER E 
WITH ACUTE.



So using iso-8859-1 is going to do the wrong thing for sure.



So let's figure out why your data fails to parse when specifying the correct 
character set encoding, utf-8.



Your hex bytes as presented are all valid Utf-8 according to this site:

http://www.endmemo.com/unicode/unicodeconverter.php



So, maybe there's a utf-8 bug in daffodil?











________________________________

From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Wednesday, October 10, 2018 9:59:16 AM
To: [email protected]<mailto:[email protected]>
Subject: Why does Daffodil change the binary of non-ASCII characters?



Hello DFDL community,



I have a binary file that contains, among other things, this text:



Nova Scotia / Nouvelle-Écosse



Its corresponding hex binary is this:



4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
6F 73 73 65 20 ...



I used this element declaration in my DFDL schema to parse that binary:



<xs:element     name="NAME"
                       type="xs:string"
                       dfdl:length="93"
                        dfdl:lengthKind="explicit"
                       dfdl:lengthUnits="characters"
                        dfdl:textTrimKind="padChar"
                        dfdl:textStringPadCharacter="%SP;"
                        dfdl:textStringJustification="center"/>



Surprisingly, during parsing Daffodil modified the text to this:



Nova Scotia / Nouvelle-Ã?cosse



With this corresponding hex binary:



4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 
6F 73 73 65 20 ...



The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing).



Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode 
codepoint.



Why did Daffodil change the binary?



One other piece of the puzzle: in my DFDL schema I specify 
encoding="ISO-8859-1". For a reason I do not understand, when I specify 
encoding="utf-8" I get an error message on parse.



Please help!



/Roger

RE: Why does Daffodil change the binary of non-ASCII characters?

Reply via email to