Adam, you should be able to write to any transport if you first
.encode('utf-8') the result there, right? ensure_ascii=False will feed
you unicode objects (if and only if there's something non-ASCII in the
input to .dumps). They of course will cause anything that attempts to
coerce them to a string to go wrong, as it'll attempt to do that by
encoding to ASCII.On 1 February 2013 16:45, Adam Lindsay <[email protected]> wrote: > Anton, Sean, > > Anton brings up a pretty interesting problem. > > At first, I thought it might be easy to remedy with: > > import json > import functools > antonjson = functools.partial(json.dumps, ensure_ascii=False) > > from riak import RiakClient > R = RiakClient() > R.set_encoder('application/json', antonjson) > > …however, upon testing this out, it's seems likely that the underlying > transport channels use the default encoding, 'ascii,' and choke on the 8-bit > data we now pass it, in socket.py (for the HTTP client) or > protobuf.internal.type_checkers (for PBC). > > Maybe that's a suitable hint for Anton's further investigation, but I'll try > to spend some time with it to see what I can find, as well. > > As to the OP's question: Yes, you've summarized the state of affairs quite > nicely. IMHO it was a reasonable default (you can't be sure other Riak > clients are as good as Python at 8-bit/Unicode!), but the underlying > implementation definitely shows a bug that (again, IMHO) should and can be > fixed. > -- > Adam Lindsay > > On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote: > > Anton, > > I don't see any reason why this can't be fixed. However, since I'm not > familiar with the specifics of the JSON implementation, I'll need > assistance. Please open an issue or pull-request on the Python client: > https://github.com/basho/riak-python-client/issues. We are open to > major, breaking changes for the next release. > > On Fri, Feb 1, 2013 at 8:06 AM, Anton <[email protected]> wrote: > > Let's talk python and Unicode (yey!) > > The objects that I want to store will have non-ASCII strings in them. > Potentially a lot. How much is a lot? "Very many millions" should be a > good estimate. > > Now, the default behaviour for storing a python object (ok, a dict of > stuff), using the PBC transport is to pass them to json and encode > them. I'm ok with that, I like JSON and the fact that I can read out > an object in JSON, using a browser, helps a lot. It's really great for > developing project-specific tools, say debugging tools. > > But here is where the fun part starts. The JSON encoder in python is > not a simple thing, and takes a lot of parameters. And by default it > works. So well that people rarely look at what's going on. When you > look at what's going on, however, things get more entertaining. > > The JSON encoder works on unicode objects, not strings. When you pass > it unicode objects, it's happy. When you pass it strings, it decodes > them, using a specified encoding. By default this is set to 'utf-8' > which makes everything quite ok. So far so good. However, there's > another option - 'ensure_ascii'. This is set to True by default and it > means that the JSON encoder will spew out an ASCII-encoded string. > That is, in the result, every unicode code-point is encoded as \u0123, > or a total of 6 bytes. > > Now, this is not good. For one, the JSON RFCs expect Unicode, encoded > using UTF-*. Also, even if much of the data will require 3bytes in > UTF-8, that's still only half the bytes that the python default would > take. > > Now, consider this elementary example. It already gives a significant > (in bytes) difference for a short string: > http://pastie.org/6011147 > > > Please tell me I'm not going crazy and all this is the state of > affairs and it is, in fact, wrong and can/should be fixed. > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > -- > Sean Cribbs <[email protected]> > Software Engineer > Basho Technologies, Inc. > http://basho.com/ > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
