On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote: > On 01/11/2014 07:38 AM, Steven D'Aprano wrote: > > > >The point that I am making is that many people want to add formatting > >operations to bytes so they can put ASCII strings inside bytes. But (as > >far as I can tell) they don't need to do this, because they can treat > >Unicode strings containing code points U+0000 through U+00FF (i.e. the > >same range as handled by Latin-1) as if they were bytes. > > So instead of blurring the line between bytes and text, you're blurring the > line between text and bytes (with a few extra seat belts thrown in).
I'm not blurring anything. The people who designed the file format that mixes textual data and binary data did the blurring. Given that such formats exist, it is inevitable that we need to put text into bytes, or bytes into text. The situation is already blurred, we just have to decide how to handle it. There are three broad strategies: 1) Make bytes more string-like, so that we can process our data as bytes, but still do string operations on the bits that are ASCII. 2) Make strings more byte-like, so that we can process our data as strings, but do byte operations (like bit mask operations) on the parts that are binary data. 3) Don't do either. Keep the text parts of your data as text, and the binary parts of your data as bytes. Do your text operations on text, and your byte operations on bytes. At some point, of course, they need to be combined. We have a choice: * Right now, we can use text as the base, and combine bytes into the text using Latin-1, and it Just Works. * Or we can wait until (maybe) Python 3.5, when (perhaps) bytes objects will be more text-like, and then use bytes as the base, and (with luck) it Should Just Work. There's another disadvantage with the second: treating bytes as if they were ASCII by default reinforces the same old harmful paradigm that text is ASCII that we're trying to get away from. That's a bad, painful idea that causes a lot of problems and buggy code, and should be resisted. On the other hand, embedding arbitrary binary data in Unicode text doesn't reinforce any common or harmful paradigms. It just requires the programmer to forget about characters and concentrate on code points, since Latin-1 maps bytes to code points in a very convenient way: Byte 0x00 maps to code point U+0000 Byte 0x01 maps to code point U+0001 Byte 0x02 maps to code point U+0002 ... Byte 0xFF maps to code point U+00FF So to embed the binary data 0xDEADBEEF in your string, you can just use '\xDE\xAD\xBE\xEF' regardless of what character those code points happen to be. If we are manipulating data *as if it were text*, then we ought to treat it as text, not add methods to bytes that makes bytes text-like. If we are manipulating data *as if it were bytes*, doing byte-manipulation operations like bit-masking, then we ought to treat it as numeric bytes, not add numeric methods to text. Is that really a controversial opinion? > Besides being a bit awkward, this also means that any encoded text (even > the plain ASCII stuff) is now being transformed three times instead of one: > > unicode to bytes > bytes to unicode using latin1 > unicode to bytes Where do you get this from? I don't follow your logic. Start with a text template: template = """\xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah """ data = template % ("George", 42, blob.decode('latin-1')) Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk. And when we do, since all the code points are in the range U+0000 to U+00FF, encoding it to Latin-1 ought to be a fast, efficient operation, possibly even just a mem copy. It's true that the individual binary data fields will been to be decoded from bytes, but unless you want Python to guess an encoding (which is the old broken Python 2 model), you're going to have to do that regardless. > Even if the cost of moving those bytes around is cheap, it's not free. > When you're creating hundreds of PDFs at a time that's going to make a > difference. You've profiled it? Unless you've measured it, it doesn't exist. I'm not going to debate performance penalties of code you haven't written yet. -- Steven _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com