Anton, Sean,
Anton brings up a pretty interesting problem.
At first, I thought it might be easy to remedy with:
import json
import functools
antonjson = functools.partial(json.dumps, ensure_ascii=False)
from riak import RiakClient
R = RiakClient()
R.set_encoder('application/json', antonjson)
…however, upon testing this out, it's seems likely that the underlying
transport channels use the default encoding, 'ascii,' and choke on the 8-bit
data we now pass it, in socket.py (for the HTTP client) or
protobuf.internal.type_checkers (for PBC).
Maybe that's a suitable hint for Anton's further investigation, but I'll try to
spend some time with it to see what I can find, as well.
As to the OP's question: Yes, you've summarized the state of affairs quite
nicely. IMHO it was a reasonable default (you can't be sure other Riak clients
are as good as Python at 8-bit/Unicode!), but the underlying implementation
definitely shows a bug that (again, IMHO) should and can be fixed.--
Adam Lindsay
On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote:
> Anton,
>
> I don't see any reason why this can't be fixed. However, since I'm not
> familiar with the specifics of the JSON implementation, I'll need
> assistance. Please open an issue or pull-request on the Python client:
> https://github.com/basho/riak-python-client/issues. We are open to
> major, breaking changes for the next release.
>
> On Fri, Feb 1, 2013 at 8:06 AM, Anton <[email protected]
> (mailto:[email protected])> wrote:
> > Let's talk python and Unicode (yey!)
> >
> > The objects that I want to store will have non-ASCII strings in them.
> > Potentially a lot. How much is a lot? "Very many millions" should be a
> > good estimate.
> >
> > Now, the default behaviour for storing a python object (ok, a dict of
> > stuff), using the PBC transport is to pass them to json and encode
> > them. I'm ok with that, I like JSON and the fact that I can read out
> > an object in JSON, using a browser, helps a lot. It's really great for
> > developing project-specific tools, say debugging tools.
> >
> > But here is where the fun part starts. The JSON encoder in python is
> > not a simple thing, and takes a lot of parameters. And by default it
> > works. So well that people rarely look at what's going on. When you
> > look at what's going on, however, things get more entertaining.
> >
> > The JSON encoder works on unicode objects, not strings. When you pass
> > it unicode objects, it's happy. When you pass it strings, it decodes
> > them, using a specified encoding. By default this is set to 'utf-8'
> > which makes everything quite ok. So far so good. However, there's
> > another option - 'ensure_ascii'. This is set to True by default and it
> > means that the JSON encoder will spew out an ASCII-encoded string.
> > That is, in the result, every unicode code-point is encoded as \u0123,
> > or a total of 6 bytes.
> >
> > Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
> > using UTF-*. Also, even if much of the data will require 3bytes in
> > UTF-8, that's still only half the bytes that the python default would
> > take.
> >
> > Now, consider this elementary example. It already gives a significant
> > (in bytes) difference for a short string:
> > http://pastie.org/6011147
> >
> >
> > Please tell me I'm not going crazy and all this is the state of
> > affairs and it is, in fact, wrong and can/should be fixed.
> >
> > _______________________________________________
> > riak-users mailing list
> > [email protected] (mailto:[email protected])
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
>
>
>
>
> --
> Sean Cribbs <[email protected] (mailto:[email protected])>
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/
>
> _______________________________________________
> riak-users mailing list
> [email protected] (mailto:[email protected])
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com