On 5/1/2014 2:04 PM, Rustom Mody wrote:

Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

I will not comment on the Unix-assumption part, but I think you go wrong with this: "Unicode is a Headache". The major headache is that unicode and its very few encodings are not universally used. The headache is all the non-unicode legacy encodings still being used. So you better title this section 'Non-Unicode is a Headache'.

The first sentence is this misleading tautology: "With ASCII, data is ASCII whether its file, core, terminal, or network; ie "ABC" is 65,66,67." Let me translate: "If all text is ASCII encoded, then text data is ASCII, whether ..." But it was never the case that all text was ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe still uses the latter. Other mainframe makers used other encodings of A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never universal. You could have just as well said "With EBCDIC, data is EBCDIC, whether ..."

https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC

A crucial step in the spread of Ascii was its use for microcomputers, including the IBM PC. The latter was considered a toy by the mainframe guys. If they had known that PCs would partly take over the computing world, they might have suggested or insisted that the it use EBCDIC.

"With unicode there are:
    encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML

If html 'always' used utf-8 (like xml), as has become common but not universal, all of the problems with *non-unicode* character sets and encodings would disappear. The pre-unicode declarations could then disappear. More truthful: "without unicode there are 100s of encodings and with unicode only 3 that we should worry about.

"in-memory formats"

These are not the concern of the using programmer as long as they do not introduce bugs or limitations (as do all the languages stuck on UCS-2 and many using UTF-16, including old Python narrow builds). Using what should generally be the universal transmission format, UFT-8, as the internal format means either losing indexing and slicing, having those operations slow from O(1) to O(len(string)), or adding an index table that is not part of the unicode standard. Using UTF-32 avoids the above but usually wasted space -- up to 75%.

"strange beasties like python's FSR"

Have you really let yourself be poisoned by JMF's bizarre rants? The FSR is an *internal optimization* that benefits most unicode operations that people actually perform. It uses UTF-32 by default but adapts to the strings users create by compressing the internal format. The compression is trivial -- simple dropping leading null bytes common to all characters -- so each character is still readable as is. The string headers records how many bytes are left. Is the idea of algorithms that adapt to inputs really strange to you?

Like good adaptive algorthms, the FSR is invisible to the user except for reducing space or time or maybe both. Unicode operations are otherwise the same as with previous wide builds. People who used to use narrow-builds also benefit from bug elimination. The only 'headaches' involved might have been those of the developers who optimized previous wide builds.

CPython has many other functions with special-case optimizations and 'fast paths' for common, simple cases. For instance, (some? all?) number operations are optimized for pairs of integers. Do you call these 'strange beasties'?

PyPy is faster than CPython, when it is, because it is even more adaptable to particular computations by creating new fast paths. The mechanism to create these 'strange beasties' might have been a headache for the writers, but when it works, which it now seems to, it is not for the users.

--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to