[jira] [Updated] (AVRO-1637) Handling multibyte UTF-8 characters in Ruby

Jackie Murphy (JIRA) Mon, 02 Feb 2015 09:17:02 -0800

     [ 
https://issues.apache.org/jira/browse/AVRO-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jackie Murphy updated AVRO-1637:
--------------------------------
    Description: 
It looks like the Ruby implementation of Avro doesn't successfully round-trip 
UTF-8 encoded strings containing multibyte characters.

Example:

{code}
require 'avro'

def serialize(obj, schema)
  buffer = StringIO.new
  encoder = Avro::IO::BinaryEncoder.new(buffer)
  datum_writer = Avro::IO::DatumWriter.new(schema)
  datum_writer.write(obj, encoder)
  buffer.seek(0)
  buffer.read
end

def deserialize(avro_obj, schema)
  reader = StringIO.new(avro_obj)
  decoder = Avro::IO::BinaryDecoder.new(reader)
  datum_reader = Avro::IO::DatumReader.new(schema)
  datum_reader.read(decoder)
end
{code}

{code}
> schema = 
> Avro::Schema.parse("{\"type\":\"record\",\"name\":\"Example\",\"fields\":[{\"name\":\"example_field\",\"type\":\"string\"},
>  {\"name\":\"other_field\",\"type\":\"string\"}]}")

> deserialize(serialize({'example_field'=> 'héllö world', 
> 'other_field'=>'goodbye world'}, schema), schema)

{"example_field"=>"h\xC3\xA9ll\xC3\xB6 wor", "other_field"=>"d\x1Agoodbye 
world"}
{code}

Note that it looks like it's computing the length of the first field 
incorrectly (length of string in characters rather than in bytes?), and the end 
of the first field spills into the second field.

Also, if the bytes happen to be especially unlucky in how they line up, we can 
get an {{ArgumentError}}

{code}
> deserialize(serialize({'example_field'=> '‘hello’ world', 
> 'other_field'=>'goodbye world'}, schema), schema)
ArgumentError: negative length -56 given
{code}

This looks similar to a previous issue with the Perl implementation in AVRO-1517


  was:
It looks like the Ruby implementation of Avro doesn't successfully round-trip 
UTF-8 encoded strings containing multibyte characters.

Example:

{code}
require 'avro'

def serialize(obj, schema)
  buffer = StringIO.new
  encoder = Avro::IO::BinaryEncoder.new(buffer)
  datum_writer = Avro::IO::DatumWriter.new(schema)
  datum_writer.write(obj, encoder)
  buffer.seek(0)
  buffer.read
end

def deserialize(avro_obj, schema)
  reader = StringIO.new(avro_obj)
  decoder = Avro::IO::BinaryDecoder.new(reader)
  datum_reader = Avro::IO::DatumReader.new(schema)
  datum_reader.read(decoder)
end
{code}

{code}
> schema = 
> Avro::Schema.parse("{\"type\":\"record\",\"name\":\"Example\",\"fields\":[{\"name\":\"example_field\",\"type\":\"string\"},
>  {\"name\":\"other_field\",\"type\":\"string\"}]}")

> deserialize(serialize({'example_field'=> 'héllö world', 
> 'other_field'=>'goodbye world'}, schema), schema)

{"example_field"=>"h\xC3\xA9ll\xC3\xB6 wor", "other_field"=>"d\x1Agoodbye 
world"}
{code}

Note that it looks like it's computing the length of the first field 
incorrectly (length of string in characters rather than in bytes?), and end of 
the first field spills into the second field.

Also, if the bytes happen to be especially unlucky in how they line up, we can 
get an {{ArgumentError}}

{code}
> deserialize(serialize({'example_field'=> '‘hello’ world', 
> 'other_field'=>'goodbye world'}, schema), schema)
ArgumentError: negative length -56 given
{code}

This looks similar to a previous issue with the Perl implementation in AVRO-1517



> Handling multibyte UTF-8 characters in Ruby
> -------------------------------------------
>
>                 Key: AVRO-1637
>                 URL: https://issues.apache.org/jira/browse/AVRO-1637
>             Project: Avro
>          Issue Type: Bug
>            Reporter: Jackie Murphy
>            Priority: Minor
>
> It looks like the Ruby implementation of Avro doesn't successfully round-trip 
> UTF-8 encoded strings containing multibyte characters.
> Example:
> {code}
> require 'avro'
> def serialize(obj, schema)
>   buffer = StringIO.new
>   encoder = Avro::IO::BinaryEncoder.new(buffer)
>   datum_writer = Avro::IO::DatumWriter.new(schema)
>   datum_writer.write(obj, encoder)
>   buffer.seek(0)
>   buffer.read
> end
> def deserialize(avro_obj, schema)
>   reader = StringIO.new(avro_obj)
>   decoder = Avro::IO::BinaryDecoder.new(reader)
>   datum_reader = Avro::IO::DatumReader.new(schema)
>   datum_reader.read(decoder)
> end
> {code}
> {code}
> > schema = 
> > Avro::Schema.parse("{\"type\":\"record\",\"name\":\"Example\",\"fields\":[{\"name\":\"example_field\",\"type\":\"string\"},
> >  {\"name\":\"other_field\",\"type\":\"string\"}]}")
> > deserialize(serialize({'example_field'=> 'héllö world', 
> > 'other_field'=>'goodbye world'}, schema), schema)
> {"example_field"=>"h\xC3\xA9ll\xC3\xB6 wor", "other_field"=>"d\x1Agoodbye 
> world"}
> {code}
> Note that it looks like it's computing the length of the first field 
> incorrectly (length of string in characters rather than in bytes?), and the 
> end of the first field spills into the second field.
> Also, if the bytes happen to be especially unlucky in how they line up, we 
> can get an {{ArgumentError}}
> {code}
> > deserialize(serialize({'example_field'=> '‘hello’ world', 
> > 'other_field'=>'goodbye world'}, schema), schema)
> ArgumentError: negative length -56 given
> {code}
> This looks similar to a previous issue with the Perl implementation in 
> AVRO-1517



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AVRO-1637) Handling multibyte UTF-8 characters in Ruby

Reply via email to