Re: Interoperability is getting better ... What does that mean?
2013-01-09 2:55, Leif Halvard Silli wrote: The benefit of doing such a comparison is that we then get to count both the HTML page *plus* all the extra fonts that is included in the romanized Singhala file. Thus, we get a more *real* basis for comparing the relative size of the two pages. Not really. I don’t want to comment “romanized Singhala” any more, but I can’t leave a different fallacy uncommented. When comparing sizes of web pages, it is clearly not sufficient to compare just HTML pages only. It is not uncommon to have just a few kilobytes of HTML but with loads on JavaScript and images, totalling a megabyte or more. This makes it relatively irrelevant whether some characters occupy one byte or two bytes. (Besides, HTML often gets automatically compressed for transmission.) But if we count font files as well, we should count them in all alternatives being compared. Although you can, in principle, write e.g. a web page in Sinhala by simply providing the text content, sitting back and expecting browsers to render it using whatever fonts they prefer using, that’s a very unrealistic approach in practice. It would work for English (though few web content providers do that – they mostly want to set fonts), but for Sinhala, it would mean that a very large part of users (possibly the majority) would not see the Sinhala letters. The reason is that their computers lack any font that contains them. (Well, not the only reason, but the most common one.) So in order to make (almost) all visitors see the content OK, the author of a Sinhala page should probably provide a downloadable font, via @font-face, that contains Sinhala letters (as a Unicode encoded font). Another option is to link to a font that the visitor can download and install, and this is what e.g. the site of the Parliament of Sri Lanka http://www.parliament.lk/ does, but the more modern way of using @font-face is much smoother and does not disturb the visitor with technicalities (and, besides, not all users can install fonts). And, to be fair, Unicode-encoded fonts that contain Sinhala letters tend to be considerably larger than 8-bit ad-hoc encoded fonts. Then again, these days, size does not matter that much, and a downloadable font gets cached, and a Unicode-encoded font typically contains a much richer repertoire of characters, so that characters from different scripts (like Sinhala, English, and Common-script characters) have been designed to fit together. Yucca
Re: Interoperability is getting better ... What does that mean?
Jukka K. Korpela, Wed, 09 Jan 2013 11:03:28 +0200: 2013-01-09 2:55, Leif Halvard Silli wrote: The benefit of doing such a comparison is that we then get to count both the HTML page *plus* all the extra fonts that is included in the romanized Singhala file. Thus, we get a more *real* basis for comparing the relative size of the two pages. Not really. I don’t want to comment “romanized Singhala” any more, but I can’t leave a different fallacy uncommented. Not sure which fallacy you have identified - see below. When comparing sizes of web pages, it is clearly not sufficient to compare just HTML pages only. It is not uncommon to have just a few kilobytes of HTML but with loads on JavaScript and images, totalling a megabyte or more. This makes it relatively irrelevant whether some characters occupy one byte or two bytes. (Besides, HTML often gets automatically compressed for transmission.) On this we agree. But if we count font files as well, we should count them in all alternatives being compared. Although you can, in principle, write e.g. a web page in Sinhala by simply providing the text content, sitting back and expecting browsers to render it using whatever fonts they prefer using, that’s a very unrealistic approach in practice. It would work for English (though few web content providers do that – they mostly want to set fonts), but for Sinhala, it would mean that a very large part of users (possibly the majority) would not see the Sinhala letters. The reason is that their computers lack any font that contains them. (Well, not the only reason, but the most common one.) In Opera for Mac OSX, the font-face embedding apparently did not work. Thus, the romanized Singhala page did not render any Sinhala at all, whereas for the Unicode version, then one got to see Sinhala letters, though with many gaps, so the Sinhala representation on Mac OS X could indeed be wrong. OTOH, when I disabled CSS in a webkit browser (iCab), the page rendered just fine, so I am not sure why it failed in Opera - could be a more complicated reason. […] And, to be fair, Unicode-encoded fonts that contain Sinhala letters tend to be considerably larger than 8-bit ad-hoc encoded fonts. Then again, these days, size does not matter that much, and a downloadable font gets cached, and a Unicode-encoded font typically contains a much richer repertoire of characters, so that characters from different scripts (like Sinhala, English, and Common-script characters) have been designed to fit together. It is indeed true that for the two pages in question, then the Unicode font was a little larger than the romanized Singhala font. However, like I told, the Unicode test page included two fonts, namely the romanized Singhala font and a Unicode Singhala font, whereas the other page only included one font - the romanized Singhala font. However, despite this difference, the Unicode Singhala page came out pretty well compared with the romanized Singhala page: It seemingly loaded faster - for some reason. And it resulted in a smaller Webkit webarchive. And, according to YSlow, it only contain 20-30% more total weight than then romanized Singhala page contained. It is of course true that it is debatable how much larger size really matters — I did not say that it did or did not matter. But regardless: There are authors and authoring tools such as YSlow, that focus on exactly that issue: size and how that and other factors, affect the page load and other user experience factors. And it was interesting to see that even from that angle, the Unicode Singhala page seemed more like the winner than the looser. -- leif halvard silli
Re: Interoperability is getting better ... What does that mean?
2013-01-09 11:57, Leif Halvard Silli wrote: Not sure which fallacy you have identified - see below. I was referring to comparison between an ad hoc 8-but encoding and a Unicode encoding so that you count the sizes font files in first case only. I’m a bit confused with your comparison, which seems to deal with a page that uses a downloadable font in both cases but uses some rather obscure fonts (from a site that has no main page etc.). In any case, my point might not apply to your specific comparison, but it applies to the general scenario: When you use a “fontistic trick”, based on the use of a font that arbitrarily maps bytes 0–255 to whatever glyphs are wanted this time, the font is a necessary ingredient of the soup. When using Unicode encoding for character data, you do not depend on any specific font, but the data still has to be rendered in *some* font. And the more rare characters you use, in terms of coverage in fonts normally available in computers worldwide, the more you will be in practice forced to deal with the font issue, not just for typography, but for having the characters displayed at all. And this quite often means that you need to embed a font (in a Word or PDF document), or to write an elaborated font-family list in CSS, or to use @font-face. Besides, on web pages, you normally need to provide a downloadable font in three or four formats to be safe. So, quite often, the size of data is increased – actually more due to the size of fonts than due to character encoding issues. But in a vast majority of cases, this price is worth paying. After all, if saving bits were our only concern, we would be using a 5-bit code. ☺ Yucca
Re: Interoperability is getting better ... What does that mean?
2013/1/9 Jukka K. Korpela jkorp...@cs.tut.fi: And, to be fair, Unicode-encoded fonts that contain Sinhala letters tend to be considerably larger than 8-bit ad-hoc encoded fonts. Then again, these days, size does not matter that much, and a downloadable font gets cached, Size does matter when you're using mobile Internet access, because data plans are now frequently limited in volume, and cost a lot when you're roaming abroad with your smartphone (even within Europe where those roaming access fees have been slightly limited) and don't have access to a local suitable free Wifi hotspot! Anyway, if size matters, no website should ever depend on downloadable fonts to be displayable. I cannot use and display the romanized Sinhala site on my smartphone, I just get garbage. But true Unicoded Sinhala pages work very well, without additional cost.
Re: Interoperability is getting better ... What does that mean?
Thank you for commenting and Happy New Year. CP-1252 is a perfectly legal web character set, and nobody is going to argue with you if you want to use it in legal ways. (I.e. writing Latin script in it, not Sinhala.) But . Okay, what is implied is I am doing something illegal. Define what I am doing that is illegal and cite the rule and its purpose of preventing what harm to whom. May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). The fo;;owing are the *only* Singhala language web pages that pass HTML validation (Challenge me): http://www.lovatasinhala.com/ They are in romanized Singhala. The statement, the death of most character sets makes everyone's systems smaller and faster is *FALSE*. Compare the sizes of the following two files that are copies of a newspaper article. The top part in red has few more words in romanized Singhala in the romanized Singhala file. Notice the size of each file: 1. http://ahangama.com/jc/uniSinDemo.htm size:38,092 bytes 2. http://ahangama.com/jc/RSDemo.htm size:18,922 bytes As the size of the page grows, the size of Unicode Sinhala tends to double the size relative to its romanized Singhala version. Unicode Sinhala characters become 50% larger when UTF-8 encoded for transmission That is three times the size of the romanized Singhala file. So, the Unicode Sinhala file consumes 3 times the bandwidth needed to send the romanized Singhala file. more likely to correctly show them the document instead of trash Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does not have the fonts. It is trash also if the font used by the OS is improperly made, such as in iPhone. It is generally trash because the SLS1134 standard corrupts at least one writing convention. (Brandy issue). On the other hand, romanized Singhala is always readable whether you have the font or not. It is not helpful to criticize Singhala related things without making a serious effort to understand the issues. Blind men thought different things about the elephant. If you mean that everyone should start using 16-bit Unicode characters, I have no objection to that. It would happen if and when all applications implement it. I cannot fight that even if I want to. But I do not see users of English doing anything different to what they are doing now, like my typing now, I think, using 8-bit characters. (I can verify that by copying it and pasting into a text editor. I showed that the Singhala can be romanized and all the problems of ill-conceived Unicode Indic can be eliminated by carefully studying the grammar of the language and romanizing. (I used the word 'transliterate' earlier, but the correct word is transcribe). I did it for Singhala and made an Open Type font to show it perfectly in the traditional Singhala script. So far, one RS smartfont and six Unicode fonts even after spending $20M for a foreign expert to tell how to make fonts though it is right on the web in the same language the expert spoke in. My work irritates some may be because it is an affront their belief that they know all and decide all. Some feel let down why they could not think of it earlier and may be write about a strange discovery like Abiguda and write a book on the nonsense. Most of all, I think it is a just cultural block on this side of the globe. As for Lankan technocrats, their worry is that the purpose of ICTA would come unraveled. I went there in November and it was revealed to me (by one of its employees) that its purpose is to provide a single point of contact for foreign vendors that can use local experts as their advocates. On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). Good idea. Done! Turned out to only be - it seems to me - an issue of mislabeling the monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very messages themselves are archived correctly. And thus I made the request that they properly label the index pages. Happy new year! -- leif h silli
Re: Interoperability is getting better ... What does that mean?
2013-01-08 23:56, Naena Guru wrote: May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). Text 2 is “romanized Singhala” only by your private definition, and you don’t even mean that. You are not actually promoting the use of Latin letters to write Sinhala but to use a private 8-bit encoding for Sinhala. You expect such a font to be used that the letter “a” is not displayed as “a” but as something completely different, as a Sinhala character. It seems that your agenda here is something very different from the Subject line you use – not about generalities, but about certain fontistic trickery. http://www.lovatasinhala.com/ If you look at the title of the page as displayed in a browser’s tab header or equivalent, you see “nivahal heøa”. This is what happens when the font trickery fails (because browsers use their fixed fonts to display such items). The trickery is nothing new. It was used even when you had to use font face on the web to use fonts, and at that time, the trickery was analyzed and found wanting, see e.g. http://alis.isoc.org/web_ml/html/fontface.en.html There’s no reason to go into such analyses any more. If you are happy with this or that trickery and don’t want them to be analyzed, just use them. But please don’t expect the rest of the world to go back to bad old days. Yucca
Re: Interoperability is getting better ... What does that mean?
I for one am so glad we now have Unicode. I remember when in pre-Unicode days my then-girlfriend was writing a PhD thesis in German about Russian linguistics. She had fonts for both alphabets, but due to technical limitations the different letters had to share the same code points. And at one point somehow the correct formatting got lost in her word processor... A complete and utter disaster! You are not serious, are you? Charlie * Naena Guru naenag...@gmail.com [2013-01-08 22:56]: Thank you for commenting and Happy New Year. CP-1252 is a perfectly legal web character set, and nobody is going to argue with you if you want to use it in legal ways. (I.e. writing Latin script in it, not Sinhala.) But . Okay, what is implied is I am doing something illegal. Define what I am doing that is illegal and cite the rule and its purpose of preventing what harm to whom. May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). The fo;;owing are the *only* Singhala language web pages that pass HTML validation (Challenge me): http://www.lovatasinhala.com/ They are in romanized Singhala. The statement, the death of most character sets makes everyone's systems smaller and faster is *FALSE*. Compare the sizes of the following two files that are copies of a newspaper article. The top part in red has few more words in romanized Singhala in the romanized Singhala file. Notice the size of each file: 1. http://ahangama.com/jc/uniSinDemo.htm size:38,092 bytes 2. http://ahangama.com/jc/RSDemo.htm size:18,922 bytes As the size of the page grows, the size of Unicode Sinhala tends to double the size relative to its romanized Singhala version. Unicode Sinhala characters become 50% larger when UTF-8 encoded for transmission That is three times the size of the romanized Singhala file. So, the Unicode Sinhala file consumes 3 times the bandwidth needed to send the romanized Singhala file. more likely to correctly show them the document instead of trash Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does not have the fonts. It is trash also if the font used by the OS is improperly made, such as in iPhone. It is generally trash because the SLS1134 standard corrupts at least one writing convention. (Brandy issue). On the other hand, romanized Singhala is always readable whether you have the font or not. It is not helpful to criticize Singhala related things without making a serious effort to understand the issues. Blind men thought different things about the elephant. If you mean that everyone should start using 16-bit Unicode characters, I have no objection to that. It would happen if and when all applications implement it. I cannot fight that even if I want to. But I do not see users of English doing anything different to what they are doing now, like my typing now, I think, using 8-bit characters. (I can verify that by copying it and pasting into a text editor. I showed that the Singhala can be romanized and all the problems of ill-conceived Unicode Indic can be eliminated by carefully studying the grammar of the language and romanizing. (I used the word 'transliterate' earlier, but the correct word is transcribe). I did it for Singhala and made an Open Type font to show it perfectly in the traditional Singhala script. So far, one RS smartfont and six Unicode fonts even after spending $20M for a foreign expert to tell how to make fonts though it is right on the web in the same language the expert spoke in. My work irritates some may be because it is an affront their belief that they know all and decide all. Some feel let down why they could not think of it earlier and may be write about a strange discovery like Abiguda and write a book on the nonsense. Most of all, I think it is a just cultural block on this side of the globe. As for Lankan technocrats, their worry is that the purpose of ICTA would come unraveled. I went there in November and it was revealed to me (by one of its employees) that its purpose is to provide a single point of contact for foreign vendors that can use local experts as their advocates. On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no mailto:xn--mlform-...@xn--mlform-iua.no wrote: Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). Good idea. Done! Turned out to only be - it seems to me - an issue of mislabeling the
Re: Interoperability is getting better ... What does that mean?
Naena Guru, Tue, 8 Jan 2013 15:56:52 -0600: The statement, the death of most character sets makes everyone's systems smaller and faster is *FALSE*. Compare the sizes of the following two files that are copies of a newspaper article. The top part in red has few more words in romanized Singhala in the romanized Singhala file. Notice the size of each file: 1. http://ahangama.com/jc/uniSinDemo.htm size:38,092 bytes 2. http://ahangama.com/jc/RSDemo.htm size:18,922 bytes [ … ] Again *demonstrably WRONG* To double check your statement, I saved the above tow pages in Safari’s webarchive format[1] and compared the resulting size of each archive file. The benefit of doing such a comparison is that we then get to count both the HTML page *plus* all the extra fonts that is included in the romanized Singhala file. Thus, we get a more *real* basis for comparing the relative size of the two pages. Here are the results: 1. http://ahangama.com/jc/uniSinDemo.htm, webarchive size: 205 459 bytes 2. http://ahangama.com/jc/RSDemo.htm, webarchive size: 223 201 bytes As you can see, the romanized Singhala file looses - it becomes bigger than the UTF-8 version. I suppose the reason for this is that for the romanized Singhala file, then the folder has to download fonts in order to display the romanized Singhala. (It tried to do the same in Firefox, using its ability to save the complete page, however it did for some reason not work). I also ran a test on both pages with the YSlow service.[2] Here are the total weight of each page, according to YSlow, when run from Firefox: 1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 92.7K 2. http://ahangama.com/jc/RSDemo.htm, YSlow size: 65.7K And here are the YSlow results from Safari: 1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 11.2K 2. http://ahangama.com/jc/RSDemo.htm, YSlow size: 9.0K Rather interesting that Safari and Firefox differs that much. But anyhow, the YSlow results are pretty clear, and demonstrates that while the romanized Singhala page is smaller, it is only between 20 and 30 percent smaller than the Unicode page. However, despite the slightly bigger size, YSlow in Firefox (don't know how to see it in Safari) *still* reported that the Unicode page loaded faster! Further more, when I inspected the source code of these to documents, then I discovered that for the the Unicode file, you included *two* downloadable fonts, whereas for the romanized Singhala page, you only included *one* downloadable font. (Why? Because both files actually contains some romanized Singhala!). Before we can *really* take those two test pages seriously, you must make sure that both pages use the same amount of fonts! As it is, then i strongly suspect that if you had included the same amount of downloadable fonts in both pages, then the Unicode page would have won. Of course, the romanized Singhala page has many usability problems as well: 1) It doesn't work with screen readers (users will hear the text as latin text), 2) it doesn’t work with Find-in-page search (users will type in Sinhala, but since the content is actually Latin, they won’t find anything on the page), 3) the title of the romanized Singhala page is (I believe) not actually readable as Singhala, 4) there are many browsers in which the romanized Singhala file will not display: text browsers, Opera and any browser where CSS is disabled. 5) You get all kinds of problems for form submission. Conclusion: Your claims about the file size advantage of romanized Singhala seems grossly exaggerated, if at all true, based as they are on a test of two files which aren actually equal when it comes to the extra CSS stuff that they embed. [1] http://en.wikipedia.org/wiki/Webarchive [2] http://yslow.org/ -- leif halvard silli
Re: Interoperability is getting better ... What does that mean?
Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). Good idea. Done! Turned out to only be - it seems to me - an issue of mislabeling the monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very messages themselves are archived correctly. And thus I made the request that they properly label the index pages. Happy new year! -- leif h silli
Re: Interoperability is getting better ... What does that mean?
It used to be that during HTML 4 days ISO8859-1 was the default character set for pages that used SBCS (those that belong to Basic Latin and Latin Extended-1). At least that is what the Validator (http://validator.w3.org/) said. (By the way, Unicode is quietly suppressing Basic Latin block by removing it from the Latin group at top of the code block page ( http://www.unicode.org/charts/) and hiding it under different names in the lower part of the page.) Now the validator complains correctly that some characters in those pages do not belong to ISO-8859-1, if you use bullet points, ellipse etc. It says they come from Windows-1252. That is true. If you declare these pages as UFT-8, then it throws off *all* Latin-1 characters and the web pages show character-not-found glyph. Windows-1252 replaces all Control codes (first 32 characters) in Latin-1 page with some common characters used by Eastern European languages and some punctuation marks. There is one main consideration in the mind of the web developer: Make the file as small as possible. Try this: Make a text file in Windows Notepad and save it in ANSI, Unicode and UTF-8 formats. ANSI file (Windows-1252) will be the smallest. Why should people make their pages larger just to satisfy some peoples idea of perfection? It reminds me of the Plain Text and language detection myths. On Mon, Dec 31, 2012 at 8:44 AM, Asmus Freytag asm...@ix.netcom.com wrote: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). A./
Re: Interoperability is getting better ... What does that mean?
On Tue, Jan 1, 2013 at 3:53 PM, Naena Guru naenag...@gmail.com wrote: Now the validator complains correctly that some characters in those pages do not belong to ISO-8859-1, if you use bullet points, ellipse etc. It says they come from Windows-1252. That is true. If you declare these pages as UFT-8, then it throws off *all* Latin-1 characters and the web pages show character-not-found glyph. And if I declare myself an English citizen, getting through borders takes a lot longer. You have to declare the pages to be what they are, which means converting all the characters to be the proper character set. There is one main consideration in the mind of the web developer: Make the file as small as possible. I'm looking at Gmail, and found that pretty wood background is a half a megabyte. If you still believe that squeezing every last byte is the goal, I suggest that modern web design has passed you by. Looking at most modern websites, they're spending hundreds or thousands of kilobytes to communicate stuff that could have been done in way less space. The Wikipedia front page is downloading 680K of Javascript; try minimizing that before messing with anything else. Try this: Make a text file in Windows Notepad and save it in ANSI, Unicode and UTF-8 formats. ANSI file (Windows-1252) will be the smallest. Not necessarily. If you've converting Arabic or Greek or Cyrillic to HTML escapes, you're going to take up more space that way. If you're dropping it, well, duh, throwing away data will generally save you space. Why should people make their pages larger just to satisfy some peoples idea of perfection? CP-1252 is a perfectly legal web character set, and nobody is going to argue with you if you want to use it in legal ways. (I.e. writing Latin script in it, not Sinhala.) But the death of most character sets makes everyone's systems smaller and faster and more likely to correctly show them the document instead of trash. -- Kie ekzistas vivo, ekzistas espero.
Re: Interoperability is getting better ... What does that mean?
Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: On 12/30/2012 3:19 PM, Leif Halvard Silli wrote: My feeling is that interoperability is getting better everywhere. But one field which lags behind is e-mail. Especially Web archives of e-mail (for instance, take the WHATwg.org’s web archive). And also some e-mail programs fail to default to UTF-8. Archiving seems to occasionally destroy whatever settings made the original work. I have seen that not only with e-mail, but also with forums that have a separate, archive format. Time to get those tools to move to UTF-8. The Web archive for this very list, needs a fix as well … -- leif h silli
Re: Interoperability is getting better ... What does that mean?
On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). A./
Interoperability is getting better ... What does that mean?
Hi Folks, I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. Do you have data to back up the assertion that interoperability is getting better? Below is a summary of my understanding of interoperability. Would you inform me of any misunderstandings please? --- Interoperability of Text (i.e., Character Encoding Interoperability) --- Remember not long ago you would visit a web page and see strange characters like this: “Good morning, Dave†You don't see that anymore. Why? The answer is this: Interoperability is getting better. In the context of character encoding and decoding, what does that mean? Interoperability means that you and I interpret (decode) the bytes in the same way. Example: I create text file, encode all the characters in it using UTF-8, and send the text file to you. Here is a graphical depiction (i.e., glyphs) of the bytes that I send to you: López You receive my text document and interpret the bytes as iso-8859-1. In UTF-8 the ó symbol is a graphical depiction of the LATIN SMALL LETTER O WITH ACUTE character and it is encoded using these two bytes: C3 B3 But in iso-8859-1, the two bytes C3 B3 is the encoding of two characters: C3 is the encoding of the à character B3 is the encoding of the ³ character Thus you interpret my text as: López We are interpreting the same text (i.e., the same set of bytes) differently. Interoperability has failed. So when we say: Interoperability is getting better. we mean that the number of incidences of senders and receivers interpreting the same bytes differently is decreasing. Let's revisit our first example. You go to a web site and see this: “Good morning, Dave†Here's how that happened: I use Microsoft Word (character set, Windows-1252) to create a web page containing this text document: “Good morning, Dave” Notice that I wrapped the greeting in Microsoft smart quotes. You visit my web page. Suppose your browser is set to interpret all web pages as iso-8859-15. In Windows-1252 the left smart quote is hex: 93 In Windows-1252 the right smart quote is hex: 84 In iso-8859-15 there are no characters assigned to either hex 93 or hex 84. So your browser replaces the left smart quote (hex 93) with hex E2 (â) followed by hex A4 (€) followed by hex BD (œ). And your browser replaces the right smart quote (hex 84) with hex E2 (â) followed by hex A4 (€). The result is that you see this on your browser screen: “Good morning, Daveâ€
Re: Interoperability is getting better ... What does that mean?
On 12/30/2012 1:22 PM, Costello, Roger L. wrote: Hi Folks, I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. Do you have data to back up the assertion that interoperability is getting better? The number of times that I receive e-mail or open web sites in other languages or scripts WITHOUT seeing garbled characters or boxes has definitely increased for me. That would be my personal observation. More people are sending me material in other scripts and languages. whether on this list or via social media. Interoperability as measured in those terms has clearly improved as well; again, as experienced personally. I still see the occasional garbled characters, most often because of a Latin-1/Latin-15 mismatch with UTF-8. Interoperability is not perfect. There's also no real reason to continue to create material in those 8-bit sets, especially, if the data is mislabeled as UTF-8 (or sometimes vice versa). In my experience, the rate of incidence for these appears to be going down as well, but I'm personally not running an actual count. I can imagine that there are places (and software configurations) that expose some users to higher rates of incidence than I am experiencing. Rather than dissecting general statements such as whether Interoperability is getting better or not, it seems more productive to address specific shortcomings of particular content providers or tools. In the final analysis, what counts is whether users can send and receive text with the lowest possible rate of problems - and if that requires transition away from certain legacy practices, it would be important to focus the energies on making sure that such transition takes place. A./
Re: Interoperability is getting better ... What does that mean?
2012-12-30 23:22, Costello, Roger L. wrote: I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. Where? It seems that this is what *you* are saying. Do you have data to back up the assertion that interoperability is getting better? Do you? Below is a summary of my understanding of interoperability. This seems to revolve around just the encoding of web pages, specifically the problem that sometimes the encoding has not been properly declared. I haven’t seen any data on the relative frequency of such problems, and I don’t know what such data would be useful for. But in my experience, such problems have been become more common, mainly because people using different encodings. One reason is that people think UTF-8 is favored but don’t quite know how to use it, e.g. declaring UTF-8 but using an authoring tool that does no actually produce UTF-8 encoded data. Yucca
Re: Interoperability is getting better ... What does that mean?
Le 2012-12-30 à 17:41, Jukka K. Korpela a écrit : 2012-12-30 23:22, Costello, Roger L. wrote: I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. Where? It seems that this is what *you* are saying. Do you have data to back up the assertion that interoperability is getting better? Do you? Below is a summary of my understanding of interoperability. This seems to revolve around just the encoding of web pages, specifically the problem that sometimes the encoding has not been properly declared. I haven’t seen any data on the relative frequency of such problems, and I don’t know what such data would be useful for. But in my experience, such problems have been become more common, mainly because people using different encodings. One reason is that people think UTF-8 is favored but don’t quite know how to use it, e.g. declaring UTF-8 but using an authoring tool that does no actually produce UTF-8 encoded data. not my experience. I agree with Asmus that overall, things are getting better. Marc. Yucca
Re: Interoperability is getting better ... What does that mean?
Jukka K. Korpela, Mon, 31 Dec 2012 00:41:41 +0200: 2012-12-30 23:22, Costello, Roger L. wrote: I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. [ … ] This seems to revolve around just the encoding of web pages, specifically the problem that sometimes the encoding has not been properly declared. I haven’t seen any data on the relative frequency of such problems, and I don’t know what such data would be useful for. But in my experience, such problems have been become more common, mainly because people using different encodings. One reason is that people think UTF-8 is favored but don’t quite know how to use it, e.g. declaring UTF-8 but using an authoring tool that does no actually produce UTF-8 encoded data. My feeling is that interoperability is getting better everywhere. But one field which lags behind is e-mail. Especially Web archives of e-mail (for instance, take the WHATwg.org’s web archive). And also some e-mail programs fail to default to UTF-8. Inter op is getting better because 1. we move towards one encoding (UTF-8) 2. an aspect of 1. is that we put more restrictions on ourselves - we respect the conventions. E.g. HTML5 blesses Win-1252 as the real default. 3. we understand the problem(s) better. (E.g. I used to think that it was good if a tool supported multiple encodings - and in a way it is good, but … it is much more important that the tool defaults to UTF-8.) There probably most productive is to file bugs against each an every tool that doesn’t default to UTF-8. -- leif halvard silli
Re: Interoperability is getting better ... What does that mean?
On 12/30/2012 3:19 PM, Leif Halvard Silli wrote: My feeling is that interoperability is getting better everywhere. But one field which lags behind is e-mail. Especially Web archives of e-mail (for instance, take the WHATwg.org’s web archive). And also some e-mail programs fail to default to UTF-8. Archiving seems to occasionally destroy whatever settings made the original work. I have seen that not only with e-mail, but also with forums that have a separate, archive format. Time to get those tools to move to UTF-8. A./