I'm unable to reproduce this issue. Would it be possible to provide your schema, test data, and the command you're using that returns the incorrect output?
On 10/10/18 12:30 PM, Costello, Roger L. wrote: > Hi Mike, > > Okay, per your suggestion I set encoding="utf-8" and in the element > declaration > for NAME, I changed dfdl:lengthUnits="characters" to > dfdl:lengthUnits="bytes". > Here’s the element declaration: > > <xs:element name="NAME" > > type="xs:string" > > dfdl:length="93" > > dfdl:lengthKind="explicit" > > dfdl:lengthUnits="bytes" > > dfdl:textTrimKind="padChar" > > dfdl:textStringPadCharacter="%SP;" > > dfdl:textStringJustification="center"/> > > Here are the set of bytes before parsing: > > 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 > 6F > 73 73 65 20 20 … > > Here are the set of bytes after parsing: > > 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C9 63 6F > 73 > 73 65 > > The changes are shown in yellow. > > That change to the element declaration has triggered other problems. > > Here is text in the original binary file: > > Nuevo León > > Here is its binary: > > 4E 75 65 76 6F 20 4C 65 C3 B3 6E > > Here is the XML that parsing generates: > > <NAME>Nuevo Le󮼯NAME> > > Here is the binary: > > 3C 4E 41 4D 45 3E 4E 75 65 76 6F 20 4C 65 F3 AE BC AF 4E 41 4D 45 3E > > The part in grey corresponds to the data. The output data is the same > > as the input data up to hex 65 and then something strange happens. > > You can see that the end tag </NAME> got mangled. > > Thoughts? > > /Roger > > *From:* Mike Beckerle <[email protected]> > *Sent:* Wednesday, October 10, 2018 11:24 AM > *To:* [email protected] > *Subject:* Re: Why does Daffodil change the binary of non-ASCII characters? > > Interesting, > > So that error says it is looking for 80 utf-8 characters, not 80 bytes. > > This is a supported behavior, but not typically what people want. Usually in > legacy formats (like dbase) lengths are in bytes. > > If you have lengthUnits='characters' in iso-8859-1 that's identical to bytes, > but in utf8 it is clearly not the same as bytes. > > Try lengthUnits="bytes". > > -------------------------------------------------------------------------------- > > *From:*Costello, Roger L. <[email protected] <mailto:[email protected]>> > *Sent:* Wednesday, October 10, 2018 11:21:17 AM > *To:* [email protected] <mailto:[email protected]> > *Subject:* RE: Why does Daffodil change the binary of non-ASCII characters? > > Hi Mike, > > Below is the error message that I get when I change encoding to utf-8 (i.e., > encoding="utf-8"). Does that help narrow down the possible problem? /Roger > > [error] Parse Error: Failed to populate record[1832]. Cause: Parse Error: > <SpecifiedLengthExplicitCharactersParser><STATEABB > parser='StringOfSpecifiedLengthParser' > /></SpecifiedLengthExplicitCharactersParser> - STATEABB - Parse failed. > Failed > to find exactly 80 characters. > > Schema context: STATEABB Location line 115 column 42 in dBase.dfdl.xsd > > Data location was preceding byte 652456. > > Schema context: sequence Location line 81 column 26 in dBase.dfdl.xsd > > Data location was preceding byte 652456 > > *From:* Mike Beckerle <[email protected] <mailto:[email protected]>> > *Sent:* Wednesday, October 10, 2018 11:03 AM > *To:* [email protected] <mailto:[email protected]> > *Subject:* Re: Why does Daffodil change the binary of non-ASCII characters? > > Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER > E > WITH ACUTE. > > So using iso-8859-1 is going to do the wrong thing for sure. > > So let's figure out why your data fails to parse when specifying the correct > character set encoding, utf-8. > > Your hex bytes as presented are all valid Utf-8 according to this site: > > http://www.endmemo.com/unicode/unicodeconverter.php > > So, maybe there's a utf-8 bug in daffodil? > > -------------------------------------------------------------------------------- > > *From:*Costello, Roger L. <[email protected] <mailto:[email protected]>> > *Sent:* Wednesday, October 10, 2018 9:59:16 AM > *To:* [email protected] <mailto:[email protected]> > *Subject:* Why does Daffodil change the binary of non-ASCII characters? > > Hello DFDL community, > > I have a binary file that contains, among other things, this text: > > Nova Scotia / Nouvelle-Écosse > > Its corresponding hex binary is this: > > 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 > 6F > 73 73 65 20 … > > I used this element declaration in my DFDL schema to parse that binary: > > <xs:element name="NAME" > type="xs:string" > dfdl:length="93" > dfdl:lengthKind="explicit" > dfdl:lengthUnits="characters" > dfdl:textTrimKind="padChar" > dfdl:textStringPadCharacter="%SP;" > dfdl:textStringJustification="center"/> > > Surprisingly, during parsing Daffodil modified the text to this: > > Nova Scotia / Nouvelle-Ã?cosse > > With this corresponding hex binary: > > 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 > 6F > 73 73 65 20 … > > The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing). > > Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode > codepoint. > > Why did Daffodil change the binary? > > One other piece of the puzzle: in my DFDL schema I specify > encoding="ISO-8859-1". For a reason I do not understand, when I > specifyencoding="utf-8" I get an error message on parse. > > Please help! > > /Roger >
