[ https://issues.apache.org/jira/browse/AVRO-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254446#comment-14254446 ]
Thiruvalluvan M. G. commented on AVRO-1593: ------------------------------------------- Looks simple enough. But constructing a locale object for every string is very expensive. On my machine, a quick microbenchmark shows that locale construction takes 200 times more times than doing iscntrl() on a 20 character string. {noformat} cat: l: No such file or directory $ cat l.cpp #include <iostream> #include <locale> int main() { for (size_t i = 0; i < 100000; ++i) { std::locale cl("C"); for (size_t j = 0; j < 20; ++j) { std::iscntrl(j % 127 + 1); } } } $ make l c++ l.cpp -o l $ time ./l real 0m6.272s user 0m6.244s sys 0m0.015s {noformat} {noformat} $ cat l.cpp #include <iostream> #include <locale> int main() { std::locale cl("C"); for (size_t i = 0; i < 100000; ++i) { for (size_t j = 0; j < 20; ++j) { std::iscntrl(j % 127 + 1); } } } $ make l c++ l.cpp -o l $ time ./l real 0m0.033s user 0m0.027s sys 0m0.003s {noformat} The current implementation will have considerable impact on performance. But if we move the local as a (non-static or even better static) member of the class, will have insignificant impact on performance. > C++ json encoder assumes "C" locale and generates invalid UTF-8 sequence > ------------------------------------------------------------------------- > > Key: AVRO-1593 > URL: https://issues.apache.org/jira/browse/AVRO-1593 > Project: Avro > Issue Type: Bug > Components: c++ > Affects Versions: 1.7.7 > Environment: windows-1252 encoding > Reporter: Hatem Helal > Priority: Critical > Fix For: 1.7.8 > > > encoding a multibyte UTF-8 code point such as: > "\xEF\xBD\x81" > Incorrectly becomes: > "\xEF\xBD\U0081" > When encoded in the service running in the windows-1252 locale. This isnĀ¹t a > valid UTF-8 sequence so we end up with Mojibake when reading back the JSON > encoded string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)