On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > >> On 2/26/2015 8:24 AM, Chris Angelico wrote: > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: > >> >> Wrote something up on why we should stop using ASCII: > >> >> http://blog.languager.org/2015/02/universal-unicode.html > >> > >> I think that the main point of the post, that many Unicode chars are > >> truly planetary rather than just national/regional, is excellent. > > > > <snipped> > > > >> You should add emoticons, but not call them or the above 'gibberish'. > >> I think that this part of your post is more 'unprofessional' than the > >> character blocks. It is very jarring and seems contrary to your main > >> point. > > > > Ok Done > > > > References to gibberish removed from > > http://blog.languager.org/2015/02/universal-unicode.html > > I consider it unethical to make semantic changes to a published work in > place without acknowledgement. Fixing minor typos or spelling errors, or > dead links, is okay. But any edit that changes the meaning should be > commented on, either by an explicit note on the page itself, or by striking > out the previous content and inserting the new.
Dunno What you are grumping about… Anyway the attribution is made more explicit – footnote 5 in http://blog.languager.org/2015/03/whimsical-unicode.html. Note Terry Reedy's post who mainly objected was already acked earlier. Ive just added one more ack¹ And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication. > > As for the content of the essay, it is currently rather unfocused. True. It > appears to be more of a list of "here are some Unicode characters I think > are interesting, divided into subgroups, oh and here are some I personally > don't have any use for, which makes them silly" than any sort of discussion > about the universality of Unicode. That makes it rather idiosyncratic and > parochial. Why should obscure maths symbols be given more importance than > obscure historical languages? Idiosyncratic ≠ parochial > > I think that the universality of Unicode could be explained in a single > sentence: > > "It is the aim of Unicode to be the one character set anyone needs to > represent every character, ideogram or symbol (but not necessarily distinct > glyph) from any existing or historical human language." > > I can expand on that, but in a nutshell that is it. > > > You state: > > "APL and Z Notation are two notable languages APL is a programming language > and Z a specification language that did not tie themselves down to a > restricted charset ..." Tsk Tsk – dihonest snipping. I wrote | APL and Z Notation are two notable languages APL is a programming language | and Z a specification language that did not tie themselves down to a | restricted charset even in the day that ASCII ruled. so its clear that the restricted applies to ASCII > > You list ideographs such as Cuneiform under "Icons". They are not icons. > They are a mixture of symbols used for consonants, syllables, and > logophonetic, consonantal alphabetic and syllabic signs. That sits them > firmly in the same categories as modern languages with consonants, ideogram > languages like Chinese, and syllabary languages like Cheyenne. Ok changed to iconic. Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages. In 2015 when someone sees them and recognizes them, they are 'those things that Sumerians/Egyptians wrote' No one except a rare expert knows those languages > > Just because native readers of Cuneiform are all dead doesn't make Cuneiform > unimportant. There are probably more people who need to write Cuneiform > than people who need to write APL source code. > > You make a comment: > > "To me – a unicode-layman – it looks unprofessional… Billions of computing > devices world over, each having billions of storage words having their > storage wasted on blocks such as these??" > > But that is nonsense, and it contradicts your earlier quoting of Dave Angel. > Why are you so worried about an (illusionary) minor optimization? 2 < 4 as far as I am concerned. [If you disagree one man's illusionary is another's waking] > > Whether code points are allocated or not doesn't affect how much space they > take up. There are millions of unused Unicode code points today. If they > are allocated tomorrow, the space your documents take up will not increase > one byte. > > Allocating code points to Cuneiform has not increased the space needed by > Unicode at all. Two bytes alone is not enough for even existing human > languages (thanks China). For hardware related reasons, it is faster and > more efficient to use four bytes than three, so the obvious and "dumb" (in > the simplest thing which will work) way to store Unicode is UTF-32, which > takes a full four bytes per code point, regardless of whether there are > 65537 code points or 1114112. That makes it less expensive than floating > point numbers, which take eight. Would you like to argue that floating > point doubles are "unprofessional" and wasteful? > > As Dave pointed out, and you apparently agreed with him enough to quote him > TWICE (once in each of two blog posts), history of computing is full of > premature optimizations for space. (In fact, some of these may have been > justified by the technical limitations of the day.) Technically Unicode is > also limited, but it is limited to over one million code points, 1114112 to > be exact, although some of them are reserved as invalid for technical > reasons, and there is no indication that we'll ever run out of space in > Unicode. > > In practice, there are three common Unicode encodings that nearly all > Unicode documents will use. > > * UTF-8 will use between one and (by memory) four bytes per code > point. For Western European languages, that will be mostly one > or two bytes per character. > > * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual > Plane, which is enough for nearly all Western European writing and > much East Asian writing as well. For the rest, it uses a fixed four > bytes per code point. > > * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses > this as a storage format. > > > In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode > doesn't change the space used. If you actually include a few hieroglyphs to > your document, the space increases only by the actual space used by those > hieroglyphs: four bytes per hieroglyph. At no time does the existence of a > single hieroglyph in your document force you to expand the non-hieroglyph > characters to use more space. > > > > What I was trying to say expanded here > > http://blog.languager.org/2015/03/whimsical-unicode.html > > You have at least two broken links, referring to a non-existent page: > > http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html Thanks corrected > > This essay seems to be even more rambling and unfocused than the first. What > does the cost of semi-conductor plants have to do with whether or not > programmers support Unicode in their applications? > > Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte > Order Mark. But if you interpret it as an explicit UTF-8 signature or mark, > it isn't so silly. If your text begins with the UTF-8 mark, treat it as > UTF-8. It's no more silly than any other heuristic, like HTML encoding tags > or text editor's encoding cookies. > > Your discussion of "complexifiers and simplifiers" doesn't seem to be > terribly relevant, or at least if it is relevant, you don't give any reason > for it. The whole thing about Moore's Law and the cost of semi-conductor > plants seems irrelevant to Unicode except in the most over-generalised > sense of "things are bigger today than in the past, we've gone from > five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point? - Most people need only 16 bits. - Many notable examples of software fail going from 16 to 23. - If you are a software writer, and you fail going 16 to 23 its ok but try to give useful errors > > You agree that 16-bits are not enough, and yet you critice Unicode for using > more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is > an inconsistent position to take. | ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support – | ASCII. BMP-only Unicode is universal enough but within practical limits | whereas full (7.0) Unicode is 'really' universal at a cost of performance and | whimsicality. Do you disagree that BMP-only = 16 bits? > > UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support. > > The problem is when your language treats UTF-16 as a fixed-width two-byte > format instead of a variable-width, two- or four-byte format. (That's more > or less like the old, obsolete, UCS-2 standard.) There are all sorts of > good ways to solve the problem of surrogate pairs and the SMPs in UTF-16. > If some programming languages or software fails to do so, they are buggy, > not UTF-16. > > After explaining that 16 bits are not enough, you then propose a 16 bit > standard. /face-palm > > UTF-16 cannot break the fixed with invariant, because it has no fixed width > invariant. That's like arguing against UTF-8 because it breaks the fixed > width invariant "all characters are single byte ASCII characters". > > If you cannot handle SMP characters, you are not supporting Unicode. 7.0 > > > You suggest that Chinese users should be looking at Big5 or GB. I really, > really don't think so. > > - Neither is universal. What makes you think that Chinese writers need > to use maths symbols, or include (say) Thai or Russian in their work > any less than Western writers do? > > - Neither even support all of Chinese. Big5 supports Traditional > Chinese, but not Simplified Chinese. GB supports Simplified > Chinese, but not Traditional Chinese. > > - Big5 likewise doesn't support placenames, many people's names, and > other less common parts of Chinese. > > - Big5 is a shift-system, like Shift-JIS, and suffers from the same sort > of data corruption issues. > > - There is no one single Big5 standard, but a whole lot of vendor > extensions. > > > You say: > > "I just want to suggest that the Unicode consortium going overboard in > adding zillions of codepoints of nearly zero usefulness, is in fact > undermining unicode’s popularity and spread." > > Can you demonstrate this? Can you show somebody who says "Well, I was going > to support full Unicode, but since they added a snowman, I'm going to stick > to ASCII"? I gave a list of softwares which goof/break going BMP to 7.0 unicode > > The "whimsical" characters you are complaining about were important enough > to somebody to spend significant amounts of time and money to write up a > proposal, have it go through the Unicode Consortium bureaucracy, and > eventually have it accepted. That's not easy or cheap, and people didn't > add a snowman on a whim. They did it because there are a whole lot of > people who want a shared standard for map symbols. > > It is easy to mock what is not important to you. I daresay kids adding emoji > to their 10 character tweets would mock all the useless maths symbols in > Unicode too. Head para of section 5 has: | However (the following) are (in the standard)! So lets use them! Looks like mocking to you The only mocking is at 5.1. And even here I dont mock the users of these blocks – now or millenia ago. I only mock the unicode consortium for putting them into unicode ---------------------- ¹ And somewhere around here we get into Gödelian problems -- known to programmers under the form "Write a program that prints itself". Likewise Acks. I am going to deal with the Gödel-loop by the device: - Address real issues/objects - Smile at grumpiness -- https://mail.python.org/mailman/listinfo/python-list