Hi Mike, Okay, per your suggestion I set encoding="utf-8" and in the element declaration for NAME, I changed dfdl:lengthUnits="characters" to dfdl:lengthUnits="bytes". Here's the element declaration:
<xs:element name="NAME" type="xs:string" dfdl:length="93" dfdl:lengthKind="explicit" dfdl:lengthUnits="bytes" dfdl:textTrimKind="padChar" dfdl:textStringPadCharacter="%SP;" dfdl:textStringJustification="center"/> Here are the set of bytes before parsing: 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 6F 73 73 65 20 20 ... Here are the set of bytes after parsing: 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C9 63 6F 73 73 65 The changes are shown in yellow. That change to the element declaration has triggered other problems. Here is text in the original binary file: Nuevo León Here is its binary: 4E 75 65 76 6F 20 4C 65 C3 B3 6E Here is the XML that parsing generates: <NAME>Nuevo Le󮼯NAME> Here is the binary: 3C 4E 41 4D 45 3E 4E 75 65 76 6F 20 4C 65 F3 AE BC AF 4E 41 4D 45 3E The part in grey corresponds to the data. The output data is the same as the input data up to hex 65 and then something strange happens. You can see that the end tag </NAME> got mangled. Thoughts? /Roger From: Mike Beckerle <[email protected]> Sent: Wednesday, October 10, 2018 11:24 AM To: [email protected] Subject: Re: Why does Daffodil change the binary of non-ASCII characters? Interesting, So that error says it is looking for 80 utf-8 characters, not 80 bytes. This is a supported behavior, but not typically what people want. Usually in legacy formats (like dbase) lengths are in bytes. If you have lengthUnits='characters' in iso-8859-1 that's identical to bytes, but in utf8 it is clearly not the same as bytes. Try lengthUnits="bytes". ________________________________ From: Costello, Roger L. <[email protected]<mailto:[email protected]>> Sent: Wednesday, October 10, 2018 11:21:17 AM To: [email protected]<mailto:[email protected]> Subject: RE: Why does Daffodil change the binary of non-ASCII characters? Hi Mike, Below is the error message that I get when I change encoding to utf-8 (i.e., encoding="utf-8"). Does that help narrow down the possible problem? /Roger [error] Parse Error: Failed to populate record[1832]. Cause: Parse Error: <SpecifiedLengthExplicitCharactersParser><STATEABB parser='StringOfSpecifiedLengthParser' /></SpecifiedLengthExplicitCharactersParser> - STATEABB - Parse failed. Failed to find exactly 80 characters. Schema context: STATEABB Location line 115 column 42 in dBase.dfdl.xsd Data location was preceding byte 652456. Schema context: sequence Location line 81 column 26 in dBase.dfdl.xsd Data location was preceding byte 652456 From: Mike Beckerle <[email protected]<mailto:[email protected]>> Sent: Wednesday, October 10, 2018 11:03 AM To: [email protected]<mailto:[email protected]> Subject: Re: Why does Daffodil change the binary of non-ASCII characters? Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER E WITH ACUTE. So using iso-8859-1 is going to do the wrong thing for sure. So let's figure out why your data fails to parse when specifying the correct character set encoding, utf-8. Your hex bytes as presented are all valid Utf-8 according to this site: http://www.endmemo.com/unicode/unicodeconverter.php So, maybe there's a utf-8 bug in daffodil? ________________________________ From: Costello, Roger L. <[email protected]<mailto:[email protected]>> Sent: Wednesday, October 10, 2018 9:59:16 AM To: [email protected]<mailto:[email protected]> Subject: Why does Daffodil change the binary of non-ASCII characters? Hello DFDL community, I have a binary file that contains, among other things, this text: Nova Scotia / Nouvelle-Écosse Its corresponding hex binary is this: 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 6F 73 73 65 20 ... I used this element declaration in my DFDL schema to parse that binary: <xs:element name="NAME" type="xs:string" dfdl:length="93" dfdl:lengthKind="explicit" dfdl:lengthUnits="characters" dfdl:textTrimKind="padChar" dfdl:textStringPadCharacter="%SP;" dfdl:textStringJustification="center"/> Surprisingly, during parsing Daffodil modified the text to this: Nova Scotia / Nouvelle-Ã?cosse With this corresponding hex binary: 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 6F 73 73 65 20 ... The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing). Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode codepoint. Why did Daffodil change the binary? One other piece of the puzzle: in my DFDL schema I specify encoding="ISO-8859-1". For a reason I do not understand, when I specify encoding="utf-8" I get an error message on parse. Please help! /Roger
