[ https://issues.apache.org/jira/browse/AVRO-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thiruvalluvan M. G. resolved AVRO-1190. --------------------------------------- Resolution: Fixed Fix Version/s: 1.9.0 Merged the Pull Request > C++ json parser fails to decode multibyte unicode code points > ------------------------------------------------------------- > > Key: AVRO-1190 > URL: https://issues.apache.org/jira/browse/AVRO-1190 > Project: Apache Avro > Issue Type: Bug > Components: c++ > Affects Versions: 1.7.0 > Reporter: Keh-Li Sheng > Priority: Major > Fix For: 1.9.0 > > > The parser in JsonIO.cc does not handle decoding a multibyte unicode > character into any kind of valid character encoding for a std::string in c++. > The following snippet from JsonParser::tryString() has several flaws: > 1. sv is a std::string used as a vector, where each unit is a char > 2. a single unicode hex quad encoded in JSON can represent a 16-bit value > 3. a unicode hex quad can represent a "high surrogate" character meaning that > it must be combined with the following quad to derive the full unicode code > point > 4. \U is not a valid unicode escape for JSON (see > http://www.ietf.org/rfc/rfc4627.txt) > {code:title=JsonIO.cc} > case 'u': > case 'U': > { > unsigned int n = 0; > char e[4]; > in_.readBytes(reinterpret_cast<uint8_t*>(e), 4); > for (int i = 0; i < 4; i++) { > n *= 16; > char c = e[i]; > if (isdigit(c)) { > n += c - '0'; > } else if (c >= 'a' && c <= 'f') { > n += c - 'a' + 10; > } else if (c >= 'A' && c <= 'F') { > n += c - 'A' + 10; > } else { > throw unexpected(c); > } > } > sv.push_back(n); > } > {code} > This code loop creates a temporary int then decodes the quad into it and then > simply pushes the int (which may be a 16-bit value) onto the std::string. > This essentially means that the JSON parser does not decode any unicode > characters. For example, this JSON string: > {noformat} > "Dress up if you dare! Free cover all night! \uD83C\uDF83\uD83D\uDC7B" > {noformat} > results in a decoded byte sequence for the last 4 characters: > {noformat} > 3C 83 3D 7B 00 > {noformat} > where you can see that it simply drops the high order bytes. In this > particular example, \uD83C is a high-surrogate character which requires some > additional handling. I am not sure what users of the c++ library expect the > encoding to be, but given that we are working with json and given that avro > c++ uses char instead of wchar, I would assume users would expect a UTF-8 > encoded string. However, I could be wrong. There are many examples of > decoders that handle this string properly - I found this one helpful while > implementing a fix: http://rishida.net/tools/conversion/ > For basics on UTF-8 http://www.utf-8.com/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)