Re: What does it mean to not be a valid string in Unicode?
On 2013/01/08 14:43, Stephan Stiller wrote: Wouldn't the clean way be to ensure valid strings (only) when they're built Of course, the earlier erroneous data gets caught, the better. The problem is that error checking is expensive, both in lines of code and in execution time (I think there is data showing that in any real-life programs, more than 50% or 80% or so is error checking, but I forgot the details). So indeed as Ken has explained with a very good example, it doesn't make sense to check at every corner. and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Convenience of implementation is an important aspect in programming. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Sorry, but I have to disagree here. If a list of strings contains items with lone surrogates (garbage), then sorting them doesn't make the garbage go away, even if the items may be sorted in correct order according to some criterion. Regards, Martin.
Re: What does it mean to not be a valid string in Unicode?
Wouldn't the clean way be to ensure valid strings (only) when they're built Of course, the earlier erroneous data gets caught, the better. The problem is that error checking is expensive, both in lines of code and in execution time (I think there is data showing that in any real-life programs, more than 50% or 80% or so is error checking, but I forgot the details). So indeed as Ken has explained with a very good example, it doesn't make sense to check at every corner. What I meant: The idea was to check only when a string is constructed. As soon as it's been fed into a collation/whatever algorithm, the algorithm should assume the original input was well-formed and shouldn't do any more error-checking, yes. Not having facilities for dealing with ill-formed values (U+D800 .. U+DFFF) in an algorithm will surely make *something* faster, even if it's just some table that's being used indirectly having fewer entries. What I had in mind is a library where the public interface only ever allows Unicode scalar values to be in- and output. This will lead to a cleaner interface. A data structure that can hold surrogate values can and should be used algorithm-*internally*, if that makes things more efficient, safer, etc. Convenience of implementation is an important aspect in programming. For a user yes, but not for a library writer/maintainer, I would suggest. The STL uses red-black trees; these are annoyingly difficult to implement but invisible to the user. Stephan
Re: Q is a Roman numeral?
Le 08/01/2013 01:26, Ben Scarborough a écrit : This isn't directly related to Unicode, but I thought this would be a good place to ask. Specifically, I'm curious about figure 14 (Gordon 1982) from WG2 N3218 [http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3218.pdf], which says: Whereas our so-called Arabic numerals are ten in number (0–9), the Roman nu- merals number nine: I = 1 (one), V = 5, X = 10, L = 50, C = 100, Đ = 500 (D reg- ularly with middle bar, the modern form being simply D), a symbol for 1,000 (see below), Q = 500,000, and a rather strange symbol for 6: ↅ. Now that Q = 500,000 bit seems a little odd to me. I've never seen that anywhere else. Does anyone know where it came from? Is there real usage of Q for 500,000? Roman numerals have always been more complex than the standard (modern) way we've been taught to, and their use spans several millennia, over which may variation have occurred. If you look at wiipedia's table for middle age and Renaissance, http://en.wikipedia.org/wiki/Roman_numeral#Middle_Ages.2FRenaissance , you'll see that many letters of the alphabet have been used as Roman numerals. In this table, Q is supposed to stand for 500, but this is not necessarily in contradiction with 500,000, since there were several ways to go beyond 1000... As a side note on non-standard Roman numeral, I've seen 80,000 written IVXXM (like quatre vingt mille) in an old french edition of the Arthurial cycle. Frédéric
RE: What does it mean to not be a valid string in Unicode?
Sorry, but I have to disagree here. If a list of strings contains items with lone surrogates (garbage), then sorting them doesn't make the garbage go away, even if the items may be sorted in correct order according to some criterion. Well, yeah, I wasn't claiming that the principled, correct output made the garbage go away. Let me put it this way: if my choices are 1) garbage in, garbage reliably sorted out into garbage bin, versus 2) garbage in, sorting fails with exception, then I'll pick #1. ;-) To give a concrete example, my implementation of UCA reliably passes the SHIFTED test cases in the conformance test, even though those test cases (deliberately) contain some ill-formed strings. If I instead did validation testing on input strings in my base implementation, it would be slower, *and* to pass the conformance test I would have to add a separate preprocessing stage that probed all the input data for ill-formed strings and filtered those cases out before engaging the test, so that it wouldn't fail with an exception when it hit the bad data. --Ken
Re: Interoperability is getting better ... What does that mean?
Thank you for commenting and Happy New Year. CP-1252 is a perfectly legal web character set, and nobody is going to argue with you if you want to use it in legal ways. (I.e. writing Latin script in it, not Sinhala.) But . Okay, what is implied is I am doing something illegal. Define what I am doing that is illegal and cite the rule and its purpose of preventing what harm to whom. May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). The fo;;owing are the *only* Singhala language web pages that pass HTML validation (Challenge me): http://www.lovatasinhala.com/ They are in romanized Singhala. The statement, the death of most character sets makes everyone's systems smaller and faster is *FALSE*. Compare the sizes of the following two files that are copies of a newspaper article. The top part in red has few more words in romanized Singhala in the romanized Singhala file. Notice the size of each file: 1. http://ahangama.com/jc/uniSinDemo.htm size:38,092 bytes 2. http://ahangama.com/jc/RSDemo.htm size:18,922 bytes As the size of the page grows, the size of Unicode Sinhala tends to double the size relative to its romanized Singhala version. Unicode Sinhala characters become 50% larger when UTF-8 encoded for transmission That is three times the size of the romanized Singhala file. So, the Unicode Sinhala file consumes 3 times the bandwidth needed to send the romanized Singhala file. more likely to correctly show them the document instead of trash Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does not have the fonts. It is trash also if the font used by the OS is improperly made, such as in iPhone. It is generally trash because the SLS1134 standard corrupts at least one writing convention. (Brandy issue). On the other hand, romanized Singhala is always readable whether you have the font or not. It is not helpful to criticize Singhala related things without making a serious effort to understand the issues. Blind men thought different things about the elephant. If you mean that everyone should start using 16-bit Unicode characters, I have no objection to that. It would happen if and when all applications implement it. I cannot fight that even if I want to. But I do not see users of English doing anything different to what they are doing now, like my typing now, I think, using 8-bit characters. (I can verify that by copying it and pasting into a text editor. I showed that the Singhala can be romanized and all the problems of ill-conceived Unicode Indic can be eliminated by carefully studying the grammar of the language and romanizing. (I used the word 'transliterate' earlier, but the correct word is transcribe). I did it for Singhala and made an Open Type font to show it perfectly in the traditional Singhala script. So far, one RS smartfont and six Unicode fonts even after spending $20M for a foreign expert to tell how to make fonts though it is right on the web in the same language the expert spoke in. My work irritates some may be because it is an affront their belief that they know all and decide all. Some feel let down why they could not think of it earlier and may be write about a strange discovery like Abiguda and write a book on the nonsense. Most of all, I think it is a just cultural block on this side of the globe. As for Lankan technocrats, their worry is that the purpose of ICTA would come unraveled. I went there in November and it was revealed to me (by one of its employees) that its purpose is to provide a single point of contact for foreign vendors that can use local experts as their advocates. On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). Good idea. Done! Turned out to only be - it seems to me - an issue of mislabeling the monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very messages themselves are archived correctly. And thus I made the request that they properly label the index pages. Happy new year! -- leif h silli
Re: Interoperability is getting better ... What does that mean?
2013-01-08 23:56, Naena Guru wrote: May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). Text 2 is “romanized Singhala” only by your private definition, and you don’t even mean that. You are not actually promoting the use of Latin letters to write Sinhala but to use a private 8-bit encoding for Sinhala. You expect such a font to be used that the letter “a” is not displayed as “a” but as something completely different, as a Sinhala character. It seems that your agenda here is something very different from the Subject line you use – not about generalities, but about certain fontistic trickery. http://www.lovatasinhala.com/ If you look at the title of the page as displayed in a browser’s tab header or equivalent, you see “nivahal heøa”. This is what happens when the font trickery fails (because browsers use their fixed fonts to display such items). The trickery is nothing new. It was used even when you had to use font face on the web to use fonts, and at that time, the trickery was analyzed and found wanting, see e.g. http://alis.isoc.org/web_ml/html/fontface.en.html There’s no reason to go into such analyses any more. If you are happy with this or that trickery and don’t want them to be analyzed, just use them. But please don’t expect the rest of the world to go back to bad old days. Yucca
Re: Interoperability is getting better ... What does that mean?
I for one am so glad we now have Unicode. I remember when in pre-Unicode days my then-girlfriend was writing a PhD thesis in German about Russian linguistics. She had fonts for both alphabets, but due to technical limitations the different letters had to share the same code points. And at one point somehow the correct formatting got lost in her word processor... A complete and utter disaster! You are not serious, are you? Charlie * Naena Guru naenag...@gmail.com [2013-01-08 22:56]: Thank you for commenting and Happy New Year. CP-1252 is a perfectly legal web character set, and nobody is going to argue with you if you want to use it in legal ways. (I.e. writing Latin script in it, not Sinhala.) But . Okay, what is implied is I am doing something illegal. Define what I am doing that is illegal and cite the rule and its purpose of preventing what harm to whom. May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). The fo;;owing are the *only* Singhala language web pages that pass HTML validation (Challenge me): http://www.lovatasinhala.com/ They are in romanized Singhala. The statement, the death of most character sets makes everyone's systems smaller and faster is *FALSE*. Compare the sizes of the following two files that are copies of a newspaper article. The top part in red has few more words in romanized Singhala in the romanized Singhala file. Notice the size of each file: 1. http://ahangama.com/jc/uniSinDemo.htm size:38,092 bytes 2. http://ahangama.com/jc/RSDemo.htm size:18,922 bytes As the size of the page grows, the size of Unicode Sinhala tends to double the size relative to its romanized Singhala version. Unicode Sinhala characters become 50% larger when UTF-8 encoded for transmission That is three times the size of the romanized Singhala file. So, the Unicode Sinhala file consumes 3 times the bandwidth needed to send the romanized Singhala file. more likely to correctly show them the document instead of trash Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does not have the fonts. It is trash also if the font used by the OS is improperly made, such as in iPhone. It is generally trash because the SLS1134 standard corrupts at least one writing convention. (Brandy issue). On the other hand, romanized Singhala is always readable whether you have the font or not. It is not helpful to criticize Singhala related things without making a serious effort to understand the issues. Blind men thought different things about the elephant. If you mean that everyone should start using 16-bit Unicode characters, I have no objection to that. It would happen if and when all applications implement it. I cannot fight that even if I want to. But I do not see users of English doing anything different to what they are doing now, like my typing now, I think, using 8-bit characters. (I can verify that by copying it and pasting into a text editor. I showed that the Singhala can be romanized and all the problems of ill-conceived Unicode Indic can be eliminated by carefully studying the grammar of the language and romanizing. (I used the word 'transliterate' earlier, but the correct word is transcribe). I did it for Singhala and made an Open Type font to show it perfectly in the traditional Singhala script. So far, one RS smartfont and six Unicode fonts even after spending $20M for a foreign expert to tell how to make fonts though it is right on the web in the same language the expert spoke in. My work irritates some may be because it is an affront their belief that they know all and decide all. Some feel let down why they could not think of it earlier and may be write about a strange discovery like Abiguda and write a book on the nonsense. Most of all, I think it is a just cultural block on this side of the globe. As for Lankan technocrats, their worry is that the purpose of ICTA would come unraveled. I went there in November and it was revealed to me (by one of its employees) that its purpose is to provide a single point of contact for foreign vendors that can use local experts as their advocates. On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no mailto:xn--mlform-...@xn--mlform-iua.no wrote: Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). Good idea. Done! Turned out to only be - it seems to me - an issue of mislabeling the
Re: Interoperability is getting better ... What does that mean?
Naena Guru, Tue, 8 Jan 2013 15:56:52 -0600: The statement, the death of most character sets makes everyone's systems smaller and faster is *FALSE*. Compare the sizes of the following two files that are copies of a newspaper article. The top part in red has few more words in romanized Singhala in the romanized Singhala file. Notice the size of each file: 1. http://ahangama.com/jc/uniSinDemo.htm size:38,092 bytes 2. http://ahangama.com/jc/RSDemo.htm size:18,922 bytes [ … ] Again *demonstrably WRONG* To double check your statement, I saved the above tow pages in Safari’s webarchive format[1] and compared the resulting size of each archive file. The benefit of doing such a comparison is that we then get to count both the HTML page *plus* all the extra fonts that is included in the romanized Singhala file. Thus, we get a more *real* basis for comparing the relative size of the two pages. Here are the results: 1. http://ahangama.com/jc/uniSinDemo.htm, webarchive size: 205 459 bytes 2. http://ahangama.com/jc/RSDemo.htm, webarchive size: 223 201 bytes As you can see, the romanized Singhala file looses - it becomes bigger than the UTF-8 version. I suppose the reason for this is that for the romanized Singhala file, then the folder has to download fonts in order to display the romanized Singhala. (It tried to do the same in Firefox, using its ability to save the complete page, however it did for some reason not work). I also ran a test on both pages with the YSlow service.[2] Here are the total weight of each page, according to YSlow, when run from Firefox: 1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 92.7K 2. http://ahangama.com/jc/RSDemo.htm, YSlow size: 65.7K And here are the YSlow results from Safari: 1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 11.2K 2. http://ahangama.com/jc/RSDemo.htm, YSlow size: 9.0K Rather interesting that Safari and Firefox differs that much. But anyhow, the YSlow results are pretty clear, and demonstrates that while the romanized Singhala page is smaller, it is only between 20 and 30 percent smaller than the Unicode page. However, despite the slightly bigger size, YSlow in Firefox (don't know how to see it in Safari) *still* reported that the Unicode page loaded faster! Further more, when I inspected the source code of these to documents, then I discovered that for the the Unicode file, you included *two* downloadable fonts, whereas for the romanized Singhala page, you only included *one* downloadable font. (Why? Because both files actually contains some romanized Singhala!). Before we can *really* take those two test pages seriously, you must make sure that both pages use the same amount of fonts! As it is, then i strongly suspect that if you had included the same amount of downloadable fonts in both pages, then the Unicode page would have won. Of course, the romanized Singhala page has many usability problems as well: 1) It doesn't work with screen readers (users will hear the text as latin text), 2) it doesn’t work with Find-in-page search (users will type in Sinhala, but since the content is actually Latin, they won’t find anything on the page), 3) the title of the romanized Singhala page is (I believe) not actually readable as Singhala, 4) there are many browsers in which the romanized Singhala file will not display: text browsers, Opera and any browser where CSS is disabled. 5) You get all kinds of problems for form submission. Conclusion: Your claims about the file size advantage of romanized Singhala seems grossly exaggerated, if at all true, based as they are on a test of two files which aren actually equal when it comes to the extra CSS stuff that they embed. [1] http://en.wikipedia.org/wiki/Webarchive [2] http://yslow.org/ -- leif halvard silli
Mark Crispin (1956-2012)
Farewell to Mark Crispin, a true friend of Unicode. http://en.wikipedia.org/wiki/Mark_Crispin https://www.ietf.org/mail-archive/web/imap5/current/msg00571.html Michael Everson * http://www.evertype.com/