On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:
> On 01/11/2014 10:36 AM, Steven D'Aprano wrote:
> >On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
> >>
> >> unicode to bytes
> >> bytes to unicode using latin1
> >> unicode to bytes
> >
> >Where do you get this from? I don't follow your logic. Start with a text
> >template:
> >
> >template = """\xDE\xAD\xBE\xEF
> >Name:\0\0\0%s
> >Age:\0\0\0\0%d
> >Data:\0\0\0%s
> >blah blah blah
> >"""
> >
> >data = template % ("George", 42, blob.decode('latin-1'))
Since the use-cases people have been speaking about include only ASCII
(or at most, Latin-1) text and arbitrary binary bytes, my example is
limited to showing only ASCII text. But it will work with any text data,
so long as you have a well-defined format that lets you tell which parts
are interpreted as text and which parts as binary data. If your file
format is not well-defined, then you have bigger problems than dealing
with text versus bytes.
> >Only the binary blobs need to be decoded. We don't need to encode the
> >template to bytes, and the textual data doesn't get encoded until we're
> >ready to send it across the wire or write it to disk.
>
> And what if your name field has data not representable in latin-1?
>
> --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
> u'\u0441\u0440\u0403'
Where did you get those bytes from? You got them from somewhere. Who
knows? Who cares? Once you have bytes, you can treat them as a blob of
arbitrary bytes and write them to the record using the Latin-1 trick. If
you're reading those bytes from some stream that gives you bytes, you
don't have to care where they came from.
But what if you don't start with bytes? If you start with a bunch of
floats, you'll probably convert them to bytes using the struct module.
If you start with non-ASCII text, you have to convert them to bytes too.
No difference here.
You ask the user for their name, they answer "срЃ" which is given to you
as a Unicode string, and you want to include it in your data record. The
specifications of your file format aren't clear, so I'm going to assume
that:
1) ASCII text is allowed "as-is" (that is, the name "George" will be
in the final data file as b'George');
2) any other non-ASCII text will be encoded as some fixed encoding
which we can choose to suit ourselves;
(if the encoding is fixed by the file format, then just use that)
3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up
being written as byte N, for any value of N between 0 and 255).
So, to write the ASCII name "George", we can just
"Name:\0\0\0%s" % "George"
since we know it is already ASCII. (It's a literal, so that's obvious.
But see below.) To write arbitrary binary data, we take the *bytes* and
decode to Latin-1:
blob = bunch_o_bytes() # Completely arbitrary.
"Data:\0\0\0%s" % blob.decode('latin-1'))
Combine those two techniques to deal with non-ASCII names. First you
have to get the non-ASCII name converted to *arbitrary bytes*, so any
encoding that deals with the whole range of Unicode will do. Then you
convert those arbitary bytes into Latin-1. Here I'll use UTF-32, just
because I can and I feel like being wasteful:
"Name:\0\0\0%s" % "срЃ".encode("utf-32be").decode("latin-1")
UTF-8 is a better choice, because it doesn't use as much space and
gives you something which looks like ASCII in a hex editor:
name = "George" if random.random() < 0.5 else "срЃ"
"Name:\0\0\0%s" % name.encode("utf-8").decode("latin-1")
If you don't know whether your name is pure ASCII, then you have to
encode first. Otherwise how do you know what bytes to use?
Aside: if this point is not *bleedingly obvious*, then you
need to read Joel on Software on Unicode RIGHT NOW.
http://www.joelonsoftware.com/articles/Unicode.html
If the name data happens to be pure ASCII, then encoding to UTF-8 and
decoding to Latin-1 ends up being a no-op:
py> "George".encode("utf-8").decode("latin-1")
'George'
Of course, if I know that the name is ASCII ahead of time (I wrote it as
a literal, so I think I would know...) then I can short-cut the whole
process and just do this:
"Name:\0\0\0%s" % name_which_is_guaranteed_to_be_ascii
If I screw up and insert a non-Latin-1 character, then when I eventually
write it to a file, it will give me a Unicode error, exactly as it
should.
I've assumed that I can pick the encoding. That's rather like assuming
that, given a bunch of floats, I can pick whether to represent them as C
doubles or singles or something else, whatever suits my purposes. If I'm
dealing with some existing file format, it probably defines the
encoding, either explicitly or implicitly. When I don't have the choice
of encoding, but have to use some damned stupid legacy encoding that
only includes a fraction of Unicode, then:
name.encode("legacy encoding", errors="whatever")
will give me the bytes I need to use the Latin-1 trick on.
This whole thing can be wrapped in a tiny one-line helper function:
def bytify(text, encoding="utf-8", errors="ignore"):
# pick your own appropriate encoding and error handler
return text.encode(encoding, errors).decode('latin-1')
> --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position
> 0-2: ordinal not in range(256)
That is backwards to what I've shown. Look at my earlier example again:
data = template % ("George", 42, blob.decode('latin-1'))
Bytes get DECODED to latin-1, not encoded.
Bytes -> text is *decoding*
Text -> bytes is *encoding*
> So really your example should be:
>
> data = template %
> ("George".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'),
> 42, blob.decode('latin-1'))
>
> Which is a mess.
Obviously it is stupid and wasteful to do that to a literal that you
know is ASCII. But if you don't know what the contents of the string
are, how do you know what bytes need to be written unless you encode to
bytes first?
--
Steven
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com