[ 
https://issues.apache.org/jira/browse/THRIFT-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Beyer updated THRIFT-1023:
---------------------------------

    Attachment: THRIFT-1023-refactor-transport-protocol-for-ruby19-v6.patch

Attachment: [^THRIFT-1023-refactor-transport-protocol-for-ruby19-v6.patch]

This patch includes all changes to support UTF-8 encoding of Thrift strings in 
the binary protocols when running on Ruby 1.9+, including test cases.

This patch also includes a few tweaks to the JSON protocol to fully support 
reading of unicode escapes sequences in the BMP (U+0000 to U+FFFF), as well as 
some test cases. I would appreciate any review of this code, as I wasn't clear 
why the original code asserted that all unicode escape sequences must be 
between U+0000 and U+00FF. I did not update the JSON protocol to fully handle 
unicode character above the BMP, as JSON uses a surrogate pair, double escape 
sequence to represent those code points and this would require a deeper 
refactoring that I don't have time for at the moment. Check out [RFC 4627 
Section 2.5 for details|http://www.ietf.org/rfc/rfc4627.txt].
                
> Thrift encoding  (UTF-8) issue with Ruby 1.9.2
> ----------------------------------------------
>
>                 Key: THRIFT-1023
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1023
>             Project: Thrift
>          Issue Type: Bug
>          Components: Ruby - Library
>    Affects Versions: 0.5
>         Environment: OSX, Ruby 1.9.2, Thrift Gem version 0.5.0
>            Reporter: Vincent Peres
>            Assignee: Jake Farrell
>             Fix For: 0.9
>
>         Attachments: THRIFT-1023-build-ruby19.patch, 
> THRIFT-1023-refactor-transport-protocol-for-ruby19.patch, 
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v2.patch, 
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v3.patch, 
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v4.patch, 
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v5.patch, 
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v6.patch, 
> thrift-1023-utf8-encoding-issue.path
>
>
> I came up with an encoding issue coming from the Thrift library, and 
> especially the BufferedTransport class.
> I've decided to write down few tests to give you a concrete example :
> # encoding: utf-8
> require 'spec_helper'
> describe "encoding" do
>  before do
>    transport = 
> Thrift::BufferedTransport.new(Thrift::Socket.new(MR_CONFIG['host'], 9090))
>    protocol  = Thrift::BinaryProtocol.new(transport)
>    @client   = Apache::Hadoop::Hbase::Thrift::Hbase::Client.new(protocol)
>    transport.open()
>    @table_name = "encoding_test"
>    @column_family = "info:"
>  end
>  it "should create a new table" do
>    column = Apache::Hadoop::Hbase::Thrift::ColumnDescriptor.new{|c| c.name= 
> @column_family}
>    @client.createTable(@table_name, [column]).should be_nil
>  end
>  it "should save standard caracteres" do
>    m        = Apache::Hadoop::Hbase::Thrift::Mutation.new
>    m.column = "info:first_name"
>    m.value  = "Vincent"
>    m.value.encoding.should == Encoding::UTF_8
>    @client.mutateRow(@table_name, "ID1", [m]).should be_nil
>  end
>  it "should save UTF8 caracteres" do
>    m        = Apache::Hadoop::Hbase::Thrift::Mutation.new
>    m.column = "info:first_name"
>    m.value  = "Thorbjørn"
>    m.value.encoding.should == Encoding::UTF_8
>    @client.mutateRow(@table_name, "ID1", [m]).should be_nil
>  end
>  it "should destroy the table" do
>    @client.disableTable(@table_name).should be_nil
>    @client.deleteTable(@table_name).should be_nil
>  end
> end
> It fails when it tries to save the UTF8 string including the caractere 'ø'.
> Here is the output :
>  1) encoding should save UTF8 caracteres
>     Failure/Error: @client.mutateRow(@table_name, "ID1", [m]).should be_nil
>     incompatible character encodings: ASCII-8BIT and UTF-8
>     
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/transport/buffered_transport.rb:59:in
> `write'
>     
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/protocol/binary_protocol.rb:107:in
> `write_string'
>     
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/client.rb:35:in
> `write'
>     
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/client.rb:35:in
> `send_message'
>     # ./lib/thrift/hbase.rb:289:in `send_mutateRow'
>     # ./lib/thrift/hbase.rb:284:in `mutateRow'
>     # ./spec/thrift/cases/encoding_spec.rb:37:in `block (2 levels) in <top
> (required)>'
> Let me know if you need any other details, thank you !

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to