[ https://issues.apache.org/jira/browse/AVRO-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863030#comment-17863030 ]
ASF subversion and git services commented on AVRO-1517: ------------------------------------------------------- Commit 677e9829bae30cc76527c6f5702f8c2384be61c5 in avro's branch refs/heads/dependabot/cargo/lang/rust/env_logger-0.11.3 from José Joaquín Atria [ https://gitbox.apache.org/repos/asf?p=avro.git;h=677e9829b ] AVRO-1517: [Perl] Encode UTF-8 strings as bytes (#2979) >From John Karp's original description of [the issue]: > By default in Perl, a string is a sequence of bytes, values 0-255. > However, if a Unicode character is included that cannot be represented > with a single byte, the string gets 'upgraded' to a non-byte-based > Unicode string allowing ordinals outside that range. When string > operations are done with byte and non-byte Unicode strings, the result > is always non-byte, with the byte string first 'upgraded'. Upgrading > consists of utf8 encoding and setting a utf8 flag on the string. ('utf8' > is a variant of UTF-8 used by Perl) > > The Perl Avro API is accepting these Unicode strings as-is for the > 'bytes' type. This is a problem because > > 1. values >255 are not valid as bytes, and any encoding is their job > > 2. As Avro assembles the serialized data, Perl 'upgrades' all the data, > having the effect of utf8 encoding our serialized binary data. > > The correct behavior is for the Avro Perl API is to attempt to downgrade > the string, and if this fails because it contained values >255 then to > raise an error. (The behavior of 'string' won't change, it will still > take Unicode strings as expected.) This change, based on the one submitted for that ticket, adds these behaviours and tests to exercise them. [the issue]: https://issues.apache.org/jira/browse/AVRO-1517 > Unicode strings are accepted as bytes and fixed type by perl API > ---------------------------------------------------------------- > > Key: AVRO-1517 > URL: https://issues.apache.org/jira/browse/AVRO-1517 > Project: Apache Avro > Issue Type: Bug > Components: perl > Reporter: John Karp > Assignee: José Joaquín Atria > Priority: Major > Fix For: 1.12.0 > > Attachments: AVRO-1517.patch > > > By default in perl, a string is a sequence of bytes, values 0-255. However, > if a Unicode character is included that cannot be represented with a single > byte, the string gets 'upgraded' to a non-byte-based Unicode string allowing > ordinals outside that range. When string operations are done with byte and > non-byte Unicode strings, the result is always non-byte, with the byte string > first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag > on the string. ('utf8' is a variant of UTF-8 used by perl) > The perl Avro API is accepting these Unicode strings as-is for the 'bytes' > type. This is a problem because 1) values >255 are not valid as bytes, and > any encoding is their job. 2) As Avro assembles the serialized data, perl > 'upgrades' all the data, having the effect of utf8 encoding our serialized > binary data. > The correct behavior is for the Avro perl API is to attempt to downgrade the > string, and if this fails because of contained values >255 then to raise an > error. (The behavior of 'string' won't change, it will still take Unicode > strings as expected.) -- This message was sent by Atlassian Jira (v8.20.10#820010)