For what it's worth, the underlying transports don't (read: shouldn't) care about the encoding of the payload. They just want a chunk of bytes. Is there an equivalent to "hey, I know this is probably a unicode or string object, but just give me the equivalent bytearray without transcoding anything"? If there is, we should be using that.
On Fri, Feb 1, 2013 at 9:57 AM, Anton <[email protected]> wrote: > Adam, you should be able to write to any transport if you first > .encode('utf-8') the result there, right? ensure_ascii=False will feed > you unicode objects (if and only if there's something non-ASCII in the > input to .dumps). They of course will cause anything that attempts to > coerce them to a string to go wrong, as it'll attempt to do that by > encoding to ASCII. > > On 1 February 2013 16:45, Adam Lindsay <[email protected]> wrote: >> Anton, Sean, >> >> Anton brings up a pretty interesting problem. >> >> At first, I thought it might be easy to remedy with: >> >> import json >> import functools >> antonjson = functools.partial(json.dumps, ensure_ascii=False) >> >> from riak import RiakClient >> R = RiakClient() >> R.set_encoder('application/json', antonjson) >> >> …however, upon testing this out, it's seems likely that the underlying >> transport channels use the default encoding, 'ascii,' and choke on the 8-bit >> data we now pass it, in socket.py (for the HTTP client) or >> protobuf.internal.type_checkers (for PBC). >> >> Maybe that's a suitable hint for Anton's further investigation, but I'll try >> to spend some time with it to see what I can find, as well. >> >> As to the OP's question: Yes, you've summarized the state of affairs quite >> nicely. IMHO it was a reasonable default (you can't be sure other Riak >> clients are as good as Python at 8-bit/Unicode!), but the underlying >> implementation definitely shows a bug that (again, IMHO) should and can be >> fixed. >> -- >> Adam Lindsay >> >> On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote: >> >> Anton, >> >> I don't see any reason why this can't be fixed. However, since I'm not >> familiar with the specifics of the JSON implementation, I'll need >> assistance. Please open an issue or pull-request on the Python client: >> https://github.com/basho/riak-python-client/issues. We are open to >> major, breaking changes for the next release. >> >> On Fri, Feb 1, 2013 at 8:06 AM, Anton <[email protected]> wrote: >> >> Let's talk python and Unicode (yey!) >> >> The objects that I want to store will have non-ASCII strings in them. >> Potentially a lot. How much is a lot? "Very many millions" should be a >> good estimate. >> >> Now, the default behaviour for storing a python object (ok, a dict of >> stuff), using the PBC transport is to pass them to json and encode >> them. I'm ok with that, I like JSON and the fact that I can read out >> an object in JSON, using a browser, helps a lot. It's really great for >> developing project-specific tools, say debugging tools. >> >> But here is where the fun part starts. The JSON encoder in python is >> not a simple thing, and takes a lot of parameters. And by default it >> works. So well that people rarely look at what's going on. When you >> look at what's going on, however, things get more entertaining. >> >> The JSON encoder works on unicode objects, not strings. When you pass >> it unicode objects, it's happy. When you pass it strings, it decodes >> them, using a specified encoding. By default this is set to 'utf-8' >> which makes everything quite ok. So far so good. However, there's >> another option - 'ensure_ascii'. This is set to True by default and it >> means that the JSON encoder will spew out an ASCII-encoded string. >> That is, in the result, every unicode code-point is encoded as \u0123, >> or a total of 6 bytes. >> >> Now, this is not good. For one, the JSON RFCs expect Unicode, encoded >> using UTF-*. Also, even if much of the data will require 3bytes in >> UTF-8, that's still only half the bytes that the python default would >> take. >> >> Now, consider this elementary example. It already gives a significant >> (in bytes) difference for a short string: >> http://pastie.org/6011147 >> >> >> Please tell me I'm not going crazy and all this is the state of >> affairs and it is, in fact, wrong and can/should be fixed. >> >> _______________________________________________ >> riak-users mailing list >> [email protected] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> >> >> -- >> Sean Cribbs <[email protected]> >> Software Engineer >> Basho Technologies, Inc. >> http://basho.com/ >> >> _______________________________________________ >> riak-users mailing list >> [email protected] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> -- Sean Cribbs <[email protected]> Software Engineer Basho Technologies, Inc. http://basho.com/ _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
