[ 
https://issues.apache.org/jira/browse/DAFFODIL-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Beckerle updated DAFFODIL-931:
--------------------------------------
    Fix Version/s:     (was: deferred)
                   2.2.0

> Variable-width charset with 'replace' can result in wrong length calculations
> -----------------------------------------------------------------------------
>
>                 Key: DAFFODIL-931
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-931
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End, General
>    Affects Versions: s12
>            Reporter: Michael Beckerle
>            Assignee: Steve Lawrence
>            Priority: Major
>             Fix For: 2.2.0
>
>
> Given a utf-8 string with a single-byte non-decodable byte in the middle.
> When we parse this the non-decodable byte will contribute a unicode 
> replacement character to the string. 0xFFFD is the character code.
> If you then take this string and call getBytes("utf-8") on it, you will not 
> get the right length. You will get 3 instead of 1 for the error because 
> 0xFFFD takes 3 bytes in utf-8.
> The way we are measuring how far to move ahead in bytes right now, when we 
> have a variable-width encoding like UTF-8, is to do exactly the above, call 
> getBytes to find how long the string was.
> This will cause us to move too far ahead into the data.
> Test case to illustrate is TBD, but isn't too hard to put together. Just put 
> a string per above with length coming from an expression. Put the string 
> between two binary int fields. The binary int field after will not be parsed 
> properly. because we will advance too far on the string.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to