On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote: > On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <st...@pearwood.info>wrote: > > > On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: > > > > > AFAIK (and just for the record), there could be both Latin1 text and > > UTF-16 > > > in a PDF (and other encodings too), depending on the font used: > > [...] > > > In Python2, txt is just a str, but in Python3 handling everything as > > latin1 > > > string obviously doesn't work for TTF in this case. > > > > Nobody is suggesting that you use Latin-1 for *everything*. We're > > suggesting that you use it for blobs of binary data that represent > > arbitrary bytes. First you have to get your binary data in the first > > place, using whatever technique is necessary. > > > Just to check I understood what you are saying. Instead of writing: > > content = b'\n'.join([ > b'header', > b'part 2 %.3f' % number, > binary_image_data, > utf16_string.encode('utf-16be'), > b'trailer'])
Which doesn't work, since bytes don't support %f in Python 3. > it should now look like: > > content = '\n'.join([ > 'header', > 'part 2 %.3f' % number, > binary_image_data.decode('latin-1'), > utf16_string.encode('utf-16be').decode('latin-1'), > 'trailer']).encode('latin-1') > > Correct? Not quite as you show. First, "utf16_string" confuses me. What is it? If it is a Unicode string, i.e.: # Python 3 semantics type(utf16_string) => returns str then the name is horribly misleading, and it is best handled like this: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer']) Note that since it's text, and content is text, there is no need to encode then decode. "UTF-16" is not another name for "Unicode". Unicode is a character set. UTF-16 is just one of a number of different encodings which map the 0x10FFFF distinct Unicode characters (actually "code points") to bytes. UTF-16 is one possible way to implement Unicode strings in memory, but not the only way. Python has, or does, use four distinct implementations: 1) UTF-16 in "narrow builds" 2) UTF-32 in "wide builds" 3) a hybrid approach starting in Python 3.3, where strings are stored as either: 3a) Latin-1 3b) UCS-2 3c) UTF-32 depending on the content of the string. So calling an arbitrary string "utf16_string" is misleading or wrong. On the other hand, if it is actually a bytes object which is the product of UTF-16 encoding, i.e.: type(utf16_string) => returns bytes and those bytes were generated by "some text".encode("utf-16"), then it is already binary data and needs to be smuggled into the text string. Latin-1 is good for that: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.decode('latin-1'), 'trailer']) Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending: content.encode('utf-8') (Don't use Latin-1, since it cannot handle the full range of text characters.) If that's not the case, then perhaps this is better suited to what you are doing: content = b'\n'.join([ b'header', ('part 2 %.3f' % number).encode('ascii'), binary_image_data, # already bytes utf16_string, # already bytes b'trailer']) -- Steven _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com