On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:
> On 01/11/2014 10:36 AM, Steven D'Aprano wrote:
> >On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
> >>
> >>   unicode to bytes
> >>   bytes to unicode using latin1
> >>   unicode to bytes
> >
> >Where do you get this from? I don't follow your logic. Start with a text
> >template:
> >
> >template = """\xDE\xAD\xBE\xEF
> >Name:\0\0\0%s
> >Age:\0\0\0\0%d
> >Data:\0\0\0%s
> >blah blah blah
> >"""
> >
> >data = template % ("George", 42, blob.decode('latin-1'))

Since the use-cases people have been speaking about include only ASCII 
(or at most, Latin-1) text and arbitrary binary bytes, my example is 
limited to showing only ASCII text. But it will work with any text data, 
so long as you have a well-defined format that lets you tell which parts 
are interpreted as text and which parts as binary data. If your file 
format is not well-defined, then you have bigger problems than dealing 
with text versus bytes.


> >Only the binary blobs need to be decoded. We don't need to encode the
> >template to bytes, and the textual data doesn't get encoded until we're
> >ready to send it across the wire or write it to disk.
> 
> And what if your name field has data not representable in latin-1?
> 
> --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
> u'\u0441\u0440\u0403'

Where did you get those bytes from? You got them from somewhere. Who 
knows? Who cares? Once you have bytes, you can treat them as a blob of 
arbitrary bytes and write them to the record using the Latin-1 trick. If 
you're reading those bytes from some stream that gives you bytes, you 
don't have to care where they came from.

But what if you don't start with bytes? If you start with a bunch of 
floats, you'll probably convert them to bytes using the struct module. 
If you start with non-ASCII text, you have to convert them to bytes too. 
No difference here.

You ask the user for their name, they answer "срЃ" which is given to you 
as a Unicode string, and you want to include it in your data record. The 
specifications of your file format aren't clear, so I'm going to assume 
that:

1) ASCII text is allowed "as-is" (that is, the name "George" will be 
   in the final data file as b'George');

2) any other non-ASCII text will be encoded as some fixed encoding 
   which we can choose to suit ourselves;

   (if the encoding is fixed by the file format, then just use that)

3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up 
   being written as byte N, for any value of N between 0 and 255).


So, to write the ASCII name "George", we can just 

"Name:\0\0\0%s" % "George"

since we know it is already ASCII. (It's a literal, so that's obvious. 
But see below.) To write arbitrary binary data, we take the *bytes* and 
decode to Latin-1:

blob = bunch_o_bytes()  # Completely arbitrary.
"Data:\0\0\0%s" % blob.decode('latin-1'))


Combine those two techniques to deal with non-ASCII names. First you 
have to get the non-ASCII name converted to *arbitrary bytes*, so any 
encoding that deals with the whole range of Unicode will do. Then you 
convert those arbitary bytes into Latin-1. Here I'll use UTF-32, just 
because I can and I feel like being wasteful:

"Name:\0\0\0%s" % "срЃ".encode("utf-32be").decode("latin-1")

UTF-8 is a better choice, because it doesn't use as much space and 
gives you something which looks like ASCII in a hex editor:

name = "George" if random.random() < 0.5 else "срЃ"
"Name:\0\0\0%s" % name.encode("utf-8").decode("latin-1")

If you don't know whether your name is pure ASCII, then you have to 
encode first. Otherwise how do you know what bytes to use?

    Aside: if this point is not *bleedingly obvious*, then you 
    need to read Joel on Software on Unicode RIGHT NOW. 

    http://www.joelonsoftware.com/articles/Unicode.html‎


If the name data happens to be pure ASCII, then encoding to UTF-8 and 
decoding to Latin-1 ends up being a no-op:

py> "George".encode("utf-8").decode("latin-1")
'George'


Of course, if I know that the name is ASCII ahead of time (I wrote it as 
a literal, so I think I would know...) then I can short-cut the whole 
process and just do this:

"Name:\0\0\0%s" % name_which_is_guaranteed_to_be_ascii


If I screw up and insert a non-Latin-1 character, then when I eventually 
write it to a file, it will give me a Unicode error, exactly as it 
should.


I've assumed that I can pick the encoding. That's rather like assuming 
that, given a bunch of floats, I can pick whether to represent them as C 
doubles or singles or something else, whatever suits my purposes. If I'm 
dealing with some existing file format, it probably defines the 
encoding, either explicitly or implicitly. When I don't have the choice 
of encoding, but have to use some damned stupid legacy encoding that 
only includes a fraction of Unicode, then:

name.encode("legacy encoding", errors="whatever")

will give me the bytes I need to use the Latin-1 trick on.

This whole thing can be wrapped in a tiny one-line helper function:

def bytify(text, encoding="utf-8", errors="ignore"):
    # pick your own appropriate encoding and error handler
    return text.encode(encoding, errors).decode('latin-1')



> --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 
> 0-2: ordinal not in range(256)

That is backwards to what I've shown. Look at my earlier example again:

data = template % ("George", 42, blob.decode('latin-1'))

Bytes get DECODED to latin-1, not encoded.

Bytes -> text is *decoding*
Text -> bytes is *encoding*


> So really your example should be:
> 
> data = template % 
> ("George".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 
> 42, blob.decode('latin-1'))
> 
> Which is a mess.

Obviously it is stupid and wasteful to do that to a literal that you 
know is ASCII. But if you don't know what the contents of the string 
are, how do you know what bytes need to be written unless you encode to 
bytes first?



-- 
Steven
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to