Hmm, if cStringIO does choke on certain Unicode inputs, then we may want to change this for complete correctness. I believe the choice of cStringIO was made for performance reasons. If we do make a change we should take care to ensure it doesn't introduce a significant regression in serialization performance.
My personal experience dealing with Python string-handling has created a lot of headaches. The distinction between the primitive unicode vs. string types is subtle but can cause a lot of weird foibles like this. Anyone know if there's a JIRA issue open on this yet? Might be worth bumping over to the -dev list to have some Python experts weight in. -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Wednesday, January 21, 2009 11:00 AM To: [email protected] Subject: Re: UTF-8 with thrift python I have moved fair bit of unicode data across thrift without problems. Python has historical issues with character/byte issues that it inherited from C. From what I hear, this is being aggressively addressed in P3, but moving to 3 will be a sllooow process as all major changes are. On Wed, Jan 21, 2009 at 2:32 AM, Emil Kirichev <[email protected]> wrote: > So my question is, does thrift really supports utf-8 (like the wiki > says), that means all chars that can be represented, not just the > ascii subset, or I am I missing something? Any user with that kind of > a problem? I did not find anything on the subject on the internet, may > be other languages (java, php) does not have that problem? > -- Ted Dunning, CTO DeepDyve 4600 Bohannon Drive, Suite 220 Menlo Park, CA 94025 www.deepdyve.com 650-324-0110, ext. 738 858-414-0013 (m)
