[
https://issues.apache.org/jira/browse/DAFFODIL-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530098#comment-16530098
]
Michael Beckerle commented on DAFFODIL-931:
-------------------------------------------
Assigning to Steve, and fix version 2.2.0 - as he is revising the charset
encoder/decoder architecture. This issue should be fixed by that.
> Variable-width charset with 'replace' can result in wrong length calculations
> -----------------------------------------------------------------------------
>
> Key: DAFFODIL-931
> URL: https://issues.apache.org/jira/browse/DAFFODIL-931
> Project: Daffodil
> Issue Type: Bug
> Components: Back End, General
> Affects Versions: s12
> Reporter: Michael Beckerle
> Assignee: Steve Lawrence
> Priority: Major
> Fix For: 2.2.0
>
>
> Given a utf-8 string with a single-byte non-decodable byte in the middle.
> When we parse this the non-decodable byte will contribute a unicode
> replacement character to the string. 0xFFFD is the character code.
> If you then take this string and call getBytes("utf-8") on it, you will not
> get the right length. You will get 3 instead of 1 for the error because
> 0xFFFD takes 3 bytes in utf-8.
> The way we are measuring how far to move ahead in bytes right now, when we
> have a variable-width encoding like UTF-8, is to do exactly the above, call
> getBytes to find how long the string was.
> This will cause us to move too far ahead into the data.
> Test case to illustrate is TBD, but isn't too hard to put together. Just put
> a string per above with length coming from an expression. Put the string
> between two binary int fields. The binary int field after will not be parsed
> properly. because we will advance too far on the string.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)