[
https://issues.apache.org/jira/browse/THRIFT-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nathan Beyer updated THRIFT-1023:
---------------------------------
Attachment: THRIFT-1023-refactor-transport-protocol-for-ruby19-v6.patch
Attachment: [^THRIFT-1023-refactor-transport-protocol-for-ruby19-v6.patch]
This patch includes all changes to support UTF-8 encoding of Thrift strings in
the binary protocols when running on Ruby 1.9+, including test cases.
This patch also includes a few tweaks to the JSON protocol to fully support
reading of unicode escapes sequences in the BMP (U+0000 to U+FFFF), as well as
some test cases. I would appreciate any review of this code, as I wasn't clear
why the original code asserted that all unicode escape sequences must be
between U+0000 and U+00FF. I did not update the JSON protocol to fully handle
unicode character above the BMP, as JSON uses a surrogate pair, double escape
sequence to represent those code points and this would require a deeper
refactoring that I don't have time for at the moment. Check out [RFC 4627
Section 2.5 for details|http://www.ietf.org/rfc/rfc4627.txt].
> Thrift encoding (UTF-8) issue with Ruby 1.9.2
> ----------------------------------------------
>
> Key: THRIFT-1023
> URL: https://issues.apache.org/jira/browse/THRIFT-1023
> Project: Thrift
> Issue Type: Bug
> Components: Ruby - Library
> Affects Versions: 0.5
> Environment: OSX, Ruby 1.9.2, Thrift Gem version 0.5.0
> Reporter: Vincent Peres
> Assignee: Jake Farrell
> Fix For: 0.9
>
> Attachments: THRIFT-1023-build-ruby19.patch,
> THRIFT-1023-refactor-transport-protocol-for-ruby19.patch,
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v2.patch,
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v3.patch,
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v4.patch,
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v5.patch,
> THRIFT-1023-refactor-transport-protocol-for-ruby19-v6.patch,
> thrift-1023-utf8-encoding-issue.path
>
>
> I came up with an encoding issue coming from the Thrift library, and
> especially the BufferedTransport class.
> I've decided to write down few tests to give you a concrete example :
> # encoding: utf-8
> require 'spec_helper'
> describe "encoding" do
> before do
> transport =
> Thrift::BufferedTransport.new(Thrift::Socket.new(MR_CONFIG['host'], 9090))
> protocol = Thrift::BinaryProtocol.new(transport)
> @client = Apache::Hadoop::Hbase::Thrift::Hbase::Client.new(protocol)
> transport.open()
> @table_name = "encoding_test"
> @column_family = "info:"
> end
> it "should create a new table" do
> column = Apache::Hadoop::Hbase::Thrift::ColumnDescriptor.new{|c| c.name=
> @column_family}
> @client.createTable(@table_name, [column]).should be_nil
> end
> it "should save standard caracteres" do
> m = Apache::Hadoop::Hbase::Thrift::Mutation.new
> m.column = "info:first_name"
> m.value = "Vincent"
> m.value.encoding.should == Encoding::UTF_8
> @client.mutateRow(@table_name, "ID1", [m]).should be_nil
> end
> it "should save UTF8 caracteres" do
> m = Apache::Hadoop::Hbase::Thrift::Mutation.new
> m.column = "info:first_name"
> m.value = "Thorbjørn"
> m.value.encoding.should == Encoding::UTF_8
> @client.mutateRow(@table_name, "ID1", [m]).should be_nil
> end
> it "should destroy the table" do
> @client.disableTable(@table_name).should be_nil
> @client.deleteTable(@table_name).should be_nil
> end
> end
> It fails when it tries to save the UTF8 string including the caractere 'ø'.
> Here is the output :
> 1) encoding should save UTF8 caracteres
> Failure/Error: @client.mutateRow(@table_name, "ID1", [m]).should be_nil
> incompatible character encodings: ASCII-8BIT and UTF-8
>
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/transport/buffered_transport.rb:59:in
> `write'
>
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/protocol/binary_protocol.rb:107:in
> `write_string'
>
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/client.rb:35:in
> `write'
>
> #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/client.rb:35:in
> `send_message'
> # ./lib/thrift/hbase.rb:289:in `send_mutateRow'
> # ./lib/thrift/hbase.rb:284:in `mutateRow'
> # ./spec/thrift/cases/encoding_spec.rb:37:in `block (2 levels) in <top
> (required)>'
> Let me know if you need any other details, thank you !
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira