[ https://issues.apache.org/jira/browse/AVRO-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Karp updated AVRO-1517: ---------------------------- Release Note: AVRO-1517 - Perl: Raise error when attempting to serialize unencoded Unicode string as 'bytes' or 'fixed' types. Status: Patch Available (was: Open) > Unicode strings are accepted as bytes type by perl API > ------------------------------------------------------ > > Key: AVRO-1517 > URL: https://issues.apache.org/jira/browse/AVRO-1517 > Project: Avro > Issue Type: Bug > Components: perl > Reporter: John Karp > Assignee: John Karp > Attachments: AVRO-1517-0.patch > > > By default in perl, a string is a sequence of bytes, values 0-255. However, > if a Unicode character is included that cannot be represented with a single > byte, the string gets 'upgraded' to a non-byte-based Unicode string allowing > ordinals outside that range. When string operations are done with byte and > non-byte Unicode strings, the result is always non-byte, with the byte string > first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag > on the string. ('utf8' is a variant of UTF-8 used by perl) > The perl Avro API is accepting these Unicode strings as-is for the 'bytes' > type. This is a problem because 1) bytes and Unicode characters are not > interchangeable, and if the user declares they are going to provide bytes > they should provide bytes; any encoding is their job. 2) As Avro assembles > the serialized data, perl 'upgrades' all the data, having the effect of utf8 > encoding our serialized binary data. > The correct behavior is for the Avro perl API to raise an error when encoding > 'bytes' and a Unicode string has been provided. (The behavior of 'string' > won't change, it will still take Unicode strings as expected.) -- This message was sent by Atlassian JIRA (v6.2#6252)