[ 
https://issues.apache.org/jira/browse/AVRO-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863030#comment-17863030
 ] 

ASF subversion and git services commented on AVRO-1517:
-------------------------------------------------------

Commit 677e9829bae30cc76527c6f5702f8c2384be61c5 in avro's branch 
refs/heads/dependabot/cargo/lang/rust/env_logger-0.11.3 from José Joaquín Atria
[ https://gitbox.apache.org/repos/asf?p=avro.git;h=677e9829b ]

AVRO-1517: [Perl] Encode UTF-8 strings as bytes (#2979)

>From John Karp's original description of [the issue]:

> By default in Perl, a string is a sequence of bytes, values 0-255.
> However, if a Unicode character is included that cannot be represented
> with a single byte, the string gets 'upgraded' to a non-byte-based
> Unicode string allowing ordinals outside that range. When string
> operations are done with byte and non-byte Unicode strings, the result
> is always non-byte, with the byte string first 'upgraded'. Upgrading
> consists of utf8 encoding and setting a utf8 flag on the string. ('utf8'
> is a variant of UTF-8 used by Perl)
>
> The Perl Avro API is accepting these Unicode strings as-is for the
> 'bytes' type. This is a problem because
>
>   1. values >255 are not valid as bytes, and any encoding is their job
>
>   2. As Avro assembles the serialized data, Perl 'upgrades' all the data,
>      having the effect of utf8 encoding our serialized binary data.
>
> The correct behavior is for the Avro Perl API is to attempt to downgrade
> the string, and if this fails because it contained values >255 then to
> raise an error. (The behavior of 'string' won't change, it will still
> take Unicode strings as expected.)

This change, based on the one submitted for that ticket, adds these
behaviours and tests to exercise them.

[the issue]: https://issues.apache.org/jira/browse/AVRO-1517

> Unicode strings are accepted as bytes and fixed type by perl API
> ----------------------------------------------------------------
>
>                 Key: AVRO-1517
>                 URL: https://issues.apache.org/jira/browse/AVRO-1517
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: perl
>            Reporter: John Karp
>            Assignee: José Joaquín Atria
>            Priority: Major
>             Fix For: 1.12.0
>
>         Attachments: AVRO-1517.patch
>
>
> By default in perl, a string is a sequence of bytes, values 0-255. However, 
> if a Unicode character is included that cannot be represented with a single 
> byte, the string gets 'upgraded' to a non-byte-based Unicode string allowing 
> ordinals outside that range. When string operations are done with byte and 
> non-byte Unicode strings, the result is always non-byte, with the byte string 
> first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag 
> on the string. ('utf8' is a variant of UTF-8 used by perl)
> The perl Avro API is accepting these Unicode strings as-is for the 'bytes' 
> type. This is a problem because 1) values >255 are not valid as bytes, and 
> any encoding is their job. 2) As Avro assembles the serialized data, perl 
> 'upgrades' all the data, having the effect of utf8 encoding our serialized 
> binary data.
> The correct behavior is for the Avro perl API is to attempt to downgrade the 
> string, and if this fails because of contained values >255 then to raise an 
> error. (The behavior of 'string' won't change, it will still take Unicode 
> strings as expected.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to