On Wed, 8 Jun 2011 12:35:57 -0400 Paul Davis <[email protected]> wrote: > On Wed, Jun 8, 2011 at 12:32 PM, MK <[email protected]> wrote: > > Is there any intention to fix couch's handling of "unusual" unicode > > characters? One of the "unusual" characters is the right single > > quote (226,128,153) which is a valid utf8 character and also not > > very "unusual" IMO.
> What version of CouchDB are you using and what is an actual request > look like? 1.0.2 built a few weeks ago. I tried to replicate this simply using curl PUT and a copy of the request dumped from node, that works okay. Ie, yep, couch deals with the multi-byte, and it is in the stdout csv decimal dump. So I took the csv decimal dump from couch in debug mode, turned it back into bytes, and diff'd it with the request. The difference: the last couple of bytes are not in the couch csv dump, such as the closing }, which would make the json invalid. Otherwise it is identical to the curl request, which goes through. Watching the transfer on wireshark, however, couch does receive those last few bytes, so *it was not truncated by me or node*. Go figure. > A recent check on trunk shows both decoders handle your case fine: I have no idea what decoders you are referring to. Anyway, for posterity, here's the issue: - Client sends utf8 data to node. - Node passes data on to couch via http (Content-type is application/x-www-form-urlencoded, identical to that used by curl). - Couch rejects data with multi-byte character, csv decimal dump is missing bytes that were in the transmission. But even to me this sounds dubious, considering an identical request from curl is fine...all I can say is that what makes a difference is a switch with this in node: case "\u2019": rv += "’"; That's the last thing I do before the PUT. If I leave the multi-byte in, there's an issue. MK -- "Enthusiasm is not the enemy of the intellect." (said of Irving Howe) "The angel of history[...]is turned toward the past." (Walter Benjamin)
