Re: Unicode 7

Terry Reedy Thu, 01 May 2014 15:41:29 -0700

On 5/1/2014 2:04 PM, Rustom Mody wrote:

Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

I will not comment on the Unix-assumption part, but I think you go wrongwith this: "Unicode is a Headache". The major headache is that unicodeand its very few encodings are not universally used. The headache is allthe non-unicode legacy encodings still being used. So you better titlethis section 'Non-Unicode is a Headache'.

The first sentence is this misleading tautology: "With ASCII, data isASCII whether its file, core, terminal, or network; ie "ABC" is65,66,67." Let me translate: "If all text is ASCII encoded, then textdata is ASCII, whether ..." But it was never the case that all text wasASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believestill uses the latter. Other mainframe makers used other encodings ofA-Z + 0-9 + symbols + control codes. The all-ASCII paradise was neveruniversal. You could have just as well said "With EBCDIC, data isEBCDIC, whether ..."


https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC

A crucial step in the spread of Ascii was its use for microcomputers,including the IBM PC. The latter was considered a toy by the mainframeguys. If they had known that PCs would partly take over the computingworld, they might have suggested or insisted that the it use EBCDIC.


"With unicode there are:
    encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML

If html 'always' used utf-8 (like xml), as has become common but notuniversal, all of the problems with *non-unicode* character sets andencodings would disappear. The pre-unicode declarations could thendisappear. More truthful: "without unicode there are 100s of encodingsand with unicode only 3 that we should worry about.


"in-memory formats"

These are not the concern of the using programmer as long as they do notintroduce bugs or limitations (as do all the languages stuck on UCS-2and many using UTF-16, including old Python narrow builds). Using whatshould generally be the universal transmission format, UFT-8, as theinternal format means either losing indexing and slicing, having thoseoperations slow from O(1) to O(len(string)), or adding an index tablethat is not part of the unicode standard. Using UTF-32 avoids the abovebut usually wasted space -- up to 75%.


"strange beasties like python's FSR"

Have you really let yourself be poisoned by JMF's bizarre rants? The FSRis an *internal optimization* that benefits most unicode operations thatpeople actually perform. It uses UTF-32 by default but adapts to thestrings users create by compressing the internal format. The compressionis trivial -- simple dropping leading null bytes common to allcharacters -- so each character is still readable as is. The stringheaders records how many bytes are left. Is the idea of algorithms thatadapt to inputs really strange to you?

Like good adaptive algorthms, the FSR is invisible to the user exceptfor reducing space or time or maybe both. Unicode operations areotherwise the same as with previous wide builds. People who used to usenarrow-builds also benefit from bug elimination. The only 'headaches'involved might have been those of the developers who optimized previouswide builds.

CPython has many other functions with special-case optimizations and'fast paths' for common, simple cases. For instance, (some? all?) numberoperations are optimized for pairs of integers. Do you call these'strange beasties'?

PyPy is faster than CPython, when it is, because it is even moreadaptable to particular computations by creating new fast paths. Themechanism to create these 'strange beasties' might have been a headachefor the writers, but when it works, which it now seems to, it is not forthe users.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode 7

Reply via email to