On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote: > Steven D'Aprano writes: > > > then the name is horribly misleading, and it is best handled like this: > > > > content = '\n'.join([ > > 'header', > > 'part 2 %.3f' % number, > > binary_image_data.decode('latin-1'), > > utf16_string, # Misleading name, actually Unicode string > > 'trailer']) > > This loses bigtime, as any encoding that can handle non-latin1 in > utf16_string will corrupt binary_image_data. OTOH, latin1 will raise > on non-latin1 characters. utf16_string must be encoded appropriately > then decoded by latin1 to be reencoded by latin1 on output.
Of course you're right, but I have understood the above as being a sketch and not real code. (E.g. does "header" really mean the literal string "header", or does it stand in for something which is a header?) In real code, one would need to have some way of telling where the binary image data ends and the Unicode string begins. If I have misunderstood the situation, then my apologies for compounding the error [...] > > Both examples assume that you intend to do further processing of content > > before sending it, and will encode just before sending: > > > > content.encode('utf-8') > > > > (Don't use Latin-1, since it cannot handle the full range of text > > characters.) > > This corrupts binary_image_data. Each byte > 127 will be replaced by > two bytes. And reading it back using decode('utf-8') will replace those two bytes with a single byte, round-tripping exactly. Of course if you encode to UTF-8 and then try to read the binary data as raw bytes, you'll get corrupted data. But do people expect to do this? That's a genuine question -- again, I assumed (apparently wrongly) that the idea was to write the content out as *text* containing smuggled bytes, and read it back the same way. > In the second case, you can use latin1 to encode, it it > gives you what you want. > > This kind of subtlety is precisely why MAL warned about use of latin1 > to smuggle bytes. How would you smuggle a chunk of arbitrary bytes into a text string? Short of doing something like uuencoding it into ASCII, or equivalent. -- Steven _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com