On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote: > Steve D'Aprano <steve+pyt...@pearwood.info>: >> On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote: [...] >>> As it stands, we have >>> >>> è --[encode>-- Unicode --[reencode>-- UTF-8 >> >> I can't even work out what you're trying to say here. > > I can tell, yet that doesn't prevent you from dismissing what I'm > saying.
How am I dismissing it? I didn't reply to it except to say I don't understand it! To me, it looks like gibberish, not even wrong, but rather than say so I thought I'd give you the opportunity to explain what you meant. As the person attempting to communicate, any failure to do so is *your* responsibility, not that of the reader. If you are discussing this in good faith, rather than as a cheap points-scoring exercise, then please try to explain what you mean. >>> Why is one encoding format better than the other? >> >> It depends on what you're trying to do. >> >> If you want to minimize storage and transmission costs, and don't care >> about random access into the string, then UTF-8 is likely the best >> encoding, since it uses as little as one byte per code point, and in >> practice with real-world text (at least for Europeans) it is rarely >> more expensive than the alternatives. > > Python3's strings don't give me any better random access than UTF-8. Say what? Of course they do. Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of generality, we can say that each string is an array of four-byte code units. (In practice, depending on the string, Python may be able to compact that to one- or two-byte code units.) The critical thing is that slicing and indexing is a constant-time operation. string[i] can just jump straight to offset i code-units into the array. If the code-units are 4 bytes wide, that's just 4*i bytes. UTF-8 is not: it is a variable-width encoding, so there's no way to tell how many bytes it takes to get to string[i]. You have to start at the beginning of the string and walk the bytes, counting code points, until you reach the i-th code point. It may be possible to swap memory for time by building an augmented data structure that makes this easier. A naive example would be to have a separate array giving the offsets of each code point. But then its not a string any more, its a more complex data structure. Go ignores this problem by simply not offering random access to code points in strings. Go simply says that strings are bytes, and if string[i] jumps into the middle of a character (code point), oh well, too bad, so sad. On the other hand, Go also offers a second solution to the problem. Its essentially the same solution that Python offers: a dedicated fixed-width, 32-bit (four byte) Unicode text string type which they call "runes". > Storage and transmission costs are not an issue. I was giving a generic answer to a generic question. You asked a general question, "Why is one encoding format better than the other?" and the general answer to that is *it depends on what you are trying to do*. > It's only that storage and transmission are still defined in terms of bytes. Again, I don't see what point you think you are making here. Ultimately, all our data structures have to be implemented in memory which is addressable in bytes. *All of them* -- objects, linked lists, floats, BigInts, associative arrays, red-black trees, the lot. All of those data structures are presented to the programmer in terms of higher level abstractions. You seem to think that text strings alone don't need that higher level abstraction, and that the programmer ought to think about text in terms of bytes. Why? You entered this discussion with a reasonable position: the text primitives offered to programmers fall short of what we'd like, which is to deal with language in terms of language units: characters specifically. (Let's assume we can decide what a character actually is.) I agree! If Python's text strings are supposed to be an abstraction for "strings of characters", its a leaky abstraction. It's actually "strings of code points". Some people might have said: "Since Python strings fall short of the abstraction we would like, we should build a better abstraction on top of it, using Unicode primitives, that deals with characters (once we decide what they are)." which is where I thought you were going with this. But instead, you've suggested that the solution to the problem: "Python strings don't come close enough to matching the programmer's expectations about characters" is to move *further away* from the programmer's expectations about characters and to have them reason about UTF-8 encoded bytes instead. And then to insult our intelligence even further, after raising the in-memory representation (UTF-8 versus some other encoding) to prominence, you then repeatedly said that the in-memory representation doesn't matter! If it doesn't matter, why do you care whether strings use UTF-8 or UTF-32 or something else? > Python3's strings > force you to encode/decode between strings and bytes for a > yet-to-be-specified advantage. That's simply wrong. You are never forced to encode/decode if you are dealing with strings alone, or bytes alone. You only need to encode/decode when converting between the two. You don't even need to explicitly decode when dealing with file I/O. Provided your files are correctly encoded, Python abstracts away the need to decode and you can just read text out of a file. So your statement is wrong. >> It also has the advantage of being backwards compatible with ASCII, so >> legacy applications that assume all characters are a single byte will >> work if you use UTF-8 and limit yourself to the ASCII-compatible >> subset of Unicode. > > UTF-8 is perfectly backward-compatible with ASCII. No it isn't. ASCII is a 7-bit encoding. No valid ASCII data has the 8th bit set. UTF-8 uses 8 bits, e.g. π in UTF-8 uses two bytes: \xcf\x80 in hex, which are: 0b11001111 0b10000000 in binary. As you can see, the eighth bit is set in both of those bytes. UTF-8 is only backwards compatible with ASCII if you limit yourself to the ASCII subset of Unicode, i.e. the 128 values between U+0000 and U+007F. >> The disadvantage is that each code point can be one, two, three or >> four bytes wide, and naively shuffling bytes around will invariably >> give you invalid UTF-8 and cause data loss. So UTF-8 is not so good as >> the in-memory representation of text strings. > > The in-memory representation is not an issue. It's the abstract > semantics that are the issue. What? You're asking about *encodings*. By definition, that means you're talking about the in-memory representation. Dear gods man, this is like you asking "Which makes for a better car, gasoline, diesel, LPG, electric or hydrogen?" and then when I start to discuss the differences between the fuels you say "I don't care about the internal differences of the engines, I only care about controls on the dashboard". Marko, it is times like this I think you are trolling, and come really close to just kill-filing you. You explicitly asked about encodings, so I answered your question about encodings. For you to now say that the encoding is irrelevant, well, just stop wasting my time. I don't think you are discussing this in good faith. I think you are arguing to win, no matter how incoherent your argument becomes, so long as you "win" for some definition of winning. I don't have infinite patience for that sort of behaviour. > At the abstract level, we have the text in a human language. Neither > strings nor UTF-8 provide that so we have to settle for something > cruder. I have yet to hear why a string does a better job than UTF-8. This is not even wrong. You are comparing a data structure, string, with a mapping, UTF-8. They aren't alternatives that we get to choose between, like "strings versus ropes" or "UTF-8 versus ISO-8859-3". They are *complementary* not alternatives: we can have strings of UTF-8 encoding text, or strings of ISO-8859-3 bytes, or ropes of UTF-8 encoded text, or ropes of ISO-8859-3 bytes. To give an analogy, you're saying "I have yet to hear why cars do a better job than electric motors." > UTF-16 (used by Windows and Java, for example) is even worse than > strings and UTF-8 because: > > è --[encode>-- Unicode --[reencode>-- UTF-16 --[reencode>-- bytes Taken at face value, this doesn't make sense. It's just gibberish. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list