Re: Interoperability is getting better ... What does that mean?

2013-01-09 Thread Jukka K. Korpela

2013-01-09 2:55, Leif Halvard Silli wrote:

 The benefit of doing such a comparison is that we then get to

count both the HTML page *plus* all the extra fonts that is included in
the romanized Singhala file. Thus, we get a more *real* basis for
comparing the relative size of the two pages.


Not really. I don’t want to comment “romanized Singhala” any more, but I 
can’t leave a different fallacy uncommented.


When comparing sizes of web pages, it is clearly not sufficient to 
compare just HTML pages only. It is not uncommon to have just a few 
kilobytes of HTML but with loads on JavaScript and images, totalling a 
megabyte or more. This makes it relatively irrelevant whether some 
characters occupy one byte or two bytes. (Besides, HTML often gets 
automatically compressed for transmission.)


But if we count font files as well, we should count them in all 
alternatives being compared. Although you can, in principle, write e.g. 
a web page in Sinhala by simply providing the text content, sitting back 
and expecting browsers to render it using whatever fonts they prefer 
using, that’s a very unrealistic approach in practice. It would work for 
English (though few web content providers do that – they mostly want to 
set fonts), but for Sinhala, it would mean that a very large part of 
users (possibly the majority) would not see the Sinhala letters. The 
reason is that their computers lack any font that contains them. (Well, 
not the only reason, but the most common one.)


So in order to make (almost) all visitors see the content OK, the author 
of a Sinhala page should probably provide a downloadable font, via 
@font-face, that contains Sinhala letters (as a Unicode encoded font). 
Another option is to link to a font that the visitor can download and 
install, and this is what e.g. the site of the Parliament of Sri Lanka 
http://www.parliament.lk/ does, but the more modern way of using 
@font-face is much smoother and does not disturb the visitor with 
technicalities (and, besides, not all users can install fonts).


And, to be fair, Unicode-encoded fonts that contain Sinhala letters tend 
to be considerably larger than 8-bit ad-hoc encoded fonts. Then again, 
these days, size does not matter that much, and a downloadable font gets 
cached, and a Unicode-encoded font typically contains a much richer 
repertoire of characters, so that characters from different scripts 
(like Sinhala, English, and Common-script characters) have been designed 
to fit together.


Yucca









Re: Interoperability is getting better ... What does that mean?

2013-01-09 Thread Leif Halvard Silli
Jukka K. Korpela, Wed, 09 Jan 2013 11:03:28 +0200:
 2013-01-09 2:55, Leif Halvard Silli wrote:
 
 The benefit of doing such a comparison is that we then get to
 count both the HTML page *plus* all the extra fonts that is included in
 the romanized Singhala file. Thus, we get a more *real* basis for
 comparing the relative size of the two pages.
 
 Not really. I don’t want to comment “romanized Singhala” any more, 
 but I can’t leave a different fallacy uncommented.

Not sure which fallacy you have identified - see below.

 When comparing sizes of web pages, it is clearly not sufficient to 
 compare just HTML pages only. It is not uncommon to have just a few 
 kilobytes of HTML but with loads on JavaScript and images, totalling 
 a megabyte or more. This makes it relatively irrelevant whether some 
 characters occupy one byte or two bytes. (Besides, HTML often gets 
 automatically compressed for transmission.)

On this we agree.

 But if we count font files as well, we should count them in all 
 alternatives being compared. Although you can, in principle, write 
 e.g. a web page in Sinhala by simply providing the text content, 
 sitting back and expecting browsers to render it using whatever fonts 
 they prefer using, that’s a very unrealistic approach in practice. It 
 would work for English (though few web content providers do that – 
 they mostly want to set fonts), but for Sinhala, it would mean that a 
 very large part of users (possibly the majority) would not see the 
 Sinhala letters. The reason is that their computers lack any font 
 that contains them. (Well, not the only reason, but the most common 
 one.)

In Opera for Mac OSX, the font-face embedding apparently did not work. 
Thus, the romanized Singhala page did not render any Sinhala at all, 
whereas for the Unicode version, then one got to see Sinhala letters, 
though with many gaps, so the Sinhala representation on Mac OS X 
could indeed be wrong. OTOH, when I disabled CSS in a webkit browser 
(iCab), the page rendered just fine, so I am not sure why it failed in 
Opera - could be a more complicated reason.
  […]
 And, to be fair, Unicode-encoded fonts that contain Sinhala letters 
 tend to be considerably larger than 8-bit ad-hoc encoded fonts. Then 
 again, these days, size does not matter that much, and a downloadable 
 font gets cached, and a Unicode-encoded font typically contains a 
 much richer repertoire of characters, so that characters from 
 different scripts (like Sinhala, English, and Common-script 
 characters) have been designed to fit together.

It is indeed true that for the two pages in question, then the Unicode 
font was a little larger than the romanized Singhala font. However, 
like I told, the Unicode test page included two fonts, namely the 
romanized Singhala font and a Unicode Singhala font, whereas the 
other page only included one font - the romanized Singhala font. 
However, despite this difference, the Unicode Singhala page came out 
pretty well compared with the romanized Singhala page: It seemingly 
loaded faster - for some reason. And it resulted in a smaller Webkit 
webarchive. And, according to YSlow, it only contain 20-30% more total 
weight than then romanized Singhala page contained.

It is of course true that it is debatable how much larger size really 
matters — I did not say that it did or did not matter. But regardless: 
There are authors and authoring tools such as YSlow, that focus on 
exactly that issue: size and how that and other factors, affect the 
page load and other user experience factors. And it was interesting to 
see that even from that angle, the Unicode Singhala page seemed more 
like the winner than the looser.
-- 
leif halvard silli




Re: Interoperability is getting better ... What does that mean?

2013-01-09 Thread Jukka K. Korpela

2013-01-09 11:57, Leif Halvard Silli wrote:


Not sure which fallacy you have identified - see below.


I was referring to comparison between an ad hoc 8-but encoding and a 
Unicode encoding so that you count the sizes font files in first case 
only. I’m a bit confused with your comparison, which seems to deal with 
a page that uses a downloadable font in both cases but uses some rather 
obscure fonts (from a site that has no main page etc.). In any case, my 
point might not apply to your specific comparison, but it applies to the 
general scenario:


When you use a “fontistic trick”, based on the use of a font that 
arbitrarily maps bytes 0–255 to whatever glyphs are wanted this time, 
the font is a necessary ingredient of the soup. When using Unicode 
encoding for character data, you do not depend on any specific font, but 
the data still has to be rendered in *some* font. And the more rare 
characters you use, in terms of coverage in fonts normally available in 
computers worldwide, the more you will be in practice forced to deal 
with the font issue, not just for typography, but for having the 
characters displayed at all. And this quite often means that you need to 
embed a font (in a Word or PDF document), or to write an elaborated 
font-family list in CSS, or to use @font-face. Besides, on web pages, 
you normally need to provide a downloadable font in three or four 
formats to be safe.


So, quite often, the size of data is increased – actually more due to 
the size of fonts than due to character encoding issues. But in a vast 
majority of cases, this price is worth paying. After all, if saving bits 
were our only concern, we would be using a 5-bit code. ☺


Yucca




Re: Interoperability is getting better ... What does that mean?

2013-01-09 Thread Philippe Verdy
2013/1/9 Jukka K. Korpela jkorp...@cs.tut.fi:
 And, to be fair, Unicode-encoded fonts that contain Sinhala letters tend to
 be considerably larger than 8-bit ad-hoc encoded fonts. Then again, these
 days, size does not matter that much, and a downloadable font gets cached,

Size does matter when you're using mobile Internet access, because
data plans are now frequently limited in volume, and cost a lot when
you're roaming abroad with your smartphone (even within Europe where
those roaming access fees have been slightly limited) and don't have
access to a local suitable free Wifi hotspot!

Anyway, if size matters, no website should ever depend on downloadable
fonts to be displayable.

I cannot use and display the romanized Sinhala site on my smartphone,
I just get garbage. But true Unicoded Sinhala pages work very well,
without additional cost.



Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Naena Guru
Thank you for commenting and Happy New Year.

CP-1252 is a perfectly legal web character set, and nobody is going to
argue with you if you want to use it in legal ways. (I.e. writing
Latin script in it, not Sinhala.) But .

Okay, what is implied is I am doing something illegal. Define what I am
doing that is illegal and cite the rule and its purpose of preventing what
harm to whom.

May I ask if the following two are Latin script, English or Singhala?

1. This is written in English.
2. mee laþingaþa síhalayi.


For me, both are Latin script and 1 is English and 2 is Singhala (says,'
this is romanized Singhala').

The fo;;owing are the *only* Singhala language web pages that pass HTML
validation (Challenge me):
http://www.lovatasinhala.com/
They are in romanized Singhala.

The statement,

the death of most character sets makes everyone's systems smaller and
faster

is *FALSE*. Compare the sizes of the following two files that are copies of
a newspaper article. The top part in red has few more words in romanized
Singhala in the romanized Singhala file. Notice the size of each file:
1. http://ahangama.com/jc/uniSinDemo.htm  size:38,092 bytes
2. http://ahangama.com/jc/RSDemo.htm  size:18,922 bytes
As the size of the page grows, the size of Unicode Sinhala tends to double
the size relative to its romanized Singhala version. Unicode Sinhala
characters become 50% larger when UTF-8 encoded for transmission  That is
three times the size of the romanized Singhala file. So, the Unicode
Sinhala file consumes 3 times the bandwidth needed to send the romanized
Singhala file.

more likely to correctly show them the document instead of trash

Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does
not have the fonts. It is trash also if the font used by the OS is
improperly made, such as in iPhone. It is generally trash because the
SLS1134 standard corrupts at least one writing convention. (Brandy issue).
On the other hand, romanized Singhala is always readable whether you have
the font or not. It is not helpful to criticize Singhala related things
without making a serious effort to understand the issues. Blind men thought
different things about the elephant.

If you mean that everyone should start using 16-bit Unicode characters, I
have no objection to that. It would happen if and when all applications
implement it. I cannot fight that even if I want to. But I do not see users
of English doing anything different to what they are doing now, like my
typing now, I think, using 8-bit characters. (I can verify that by copying
it and pasting into a text editor.

I showed that the Singhala can be romanized and all the problems of
ill-conceived Unicode Indic can be eliminated by carefully studying the
grammar of the language and romanizing. (I used the word 'transliterate'
earlier, but the correct word is transcribe). I did it for Singhala and
made an Open Type font to show it perfectly in the traditional Singhala
script. So far, one RS smartfont and six Unicode fonts even after spending
$20M for a foreign expert to tell how to make fonts though it is right on
the web in the same language the expert spoke in.

My work irritates some may be because it is an affront their belief that
they know all and decide all. Some  feel let down why they could not think
of it earlier and may be write about a strange discovery like Abiguda and
write a book on the nonsense. Most of all, I think it is a just cultural
block on this side of the globe.

As for Lankan technocrats, their worry is that the purpose of ICTA would
come unraveled.  I went there in November and it was revealed to me (by one
of its employees) that its purpose is to provide a single point of contact
for foreign vendors that can use local experts as their advocates.


On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no wrote:

 Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800:
  On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:
  Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
  The Web archive for this very list, needs a fix as well …
 
 
  The way to formally request any action by the Unicode Consortium is
  via the contact form (found on the home page).

 Good idea. Done!

 Turned out to only be - it seems to me - an issue of mislabeling the
 monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very
 messages themselves are archived correctly. And thus I made the request
 that they properly label the index pages.

 Happy new year!
 --
 leif h silli





Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Jukka K. Korpela

2013-01-08 23:56, Naena Guru wrote:


May I ask if the following two are Latin script, English or Singhala?

1. This is written in English.
2. mee laþingaþa síhalayi.

For me, both are Latin script and 1 is English and 2 is Singhala (says,'
this is romanized Singhala').


Text 2 is “romanized Singhala” only by your private definition, and you 
don’t even mean that. You are not actually promoting the use of Latin 
letters to write Sinhala but to use a private 8-bit encoding for 
Sinhala. You expect such a font to be used that the letter “a” is not 
displayed as “a” but as something completely different, as a Sinhala 
character.


It seems that your agenda here is something very different from the 
Subject line you use – not about generalities, but about certain 
fontistic trickery.



http://www.lovatasinhala.com/


If you look at the title of the page as displayed in a browser’s tab 
header or equivalent, you see “nivahal heøa”. This is what happens when 
the font trickery fails (because browsers use their fixed fonts to 
display such items).


The trickery is nothing new. It was used even when you had to use font 
face on the web to use fonts, and at that time, the trickery was 
analyzed and found wanting, see e.g.

http://alis.isoc.org/web_ml/html/fontface.en.html
There’s no reason to go into such analyses any more.

If you are happy with this or that trickery and don’t want them to be 
analyzed, just use them. But please don’t expect the rest of the world 
to go back to bad old days.


Yucca





Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Charlie Ruland

I for one am so glad we now have Unicode.

I remember when in pre-Unicode days my then-girlfriend was writing a PhD 
thesis in German about Russian linguistics.  She had fonts for both 
alphabets, but due to technical limitations the different letters had to 
share the same code points.  And at one point somehow the correct 
formatting got lost in her word processor...  A complete and utter disaster!


You are not serious, are you?

Charlie


* Naena Guru naenag...@gmail.com [2013-01-08 22:56]:


Thank you for commenting and Happy New Year.

CP-1252 is a perfectly legal web character set, and nobody is going to
argue with you if you want to use it in legal ways. (I.e. writing
Latin script in it, not Sinhala.) But .

Okay, what is implied is I am doing something illegal. Define what I 
am doing that is illegal and cite the rule and its purpose of 
preventing what harm to whom.


May I ask if the following two are Latin script, English or Singhala?

1. This is written in English.
2. mee laþingaþa síhalayi.


For me, both are Latin script and 1 is English and 2 is Singhala 
(says,' this is romanized Singhala').


The fo;;owing are the *only* Singhala language web pages that pass 
HTML validation (Challenge me):

http://www.lovatasinhala.com/
They are in romanized Singhala.

The statement,

the death of most character sets makes everyone's systems smaller
and faster

is *FALSE*. Compare the sizes of the following two files that are 
copies of a newspaper article. The top part in red has few more words 
in romanized Singhala in the romanized Singhala file. Notice the size 
of each file:
1. http://ahangama.com/jc/uniSinDemo.htm  size:38,092 
bytes
2. http://ahangama.com/jc/RSDemo.htm  size:18,922 
bytes
As the size of the page grows, the size of Unicode Sinhala tends to 
double the size relative to its romanized Singhala version. Unicode 
Sinhala characters become 50% larger when UTF-8 encoded 
for transmission  That is three times the size of the romanized 
Singhala file. So, the Unicode Sinhala file consumes 3 times the 
bandwidth needed to send the romanized Singhala file.


more likely to correctly show them the document instead of trash

Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that 
does not have the fonts. It is trash also if the font used by the OS 
is improperly made, such as in iPhone. It is generally trash because 
the SLS1134 standard corrupts at least one writing convention. (Brandy 
issue). On the other hand, romanized Singhala is always readable 
whether you have the font or not. It is not helpful to criticize 
Singhala related things without making a serious effort to understand 
the issues. Blind men thought different things about the elephant.


If you mean that everyone should start using 16-bit Unicode 
characters, I have no objection to that. It would happen if and 
when all applications implement it. I cannot fight that even if I want 
to. But I do not see users of English doing anything different to what 
they are doing now, like my typing now, I think, using 8-bit 
characters. (I can verify that by copying it and pasting into a text 
editor.


I showed that the Singhala can be romanized and all the problems of 
ill-conceived Unicode Indic can be eliminated by carefully studying 
the grammar of the language and romanizing. (I used the word 
'transliterate' earlier, but the correct word is transcribe). I did it 
for Singhala and made an Open Type font to show it perfectly in the 
traditional Singhala script. So far, one RS smartfont and six Unicode 
fonts even after spending $20M for a foreign expert to tell how to 
make fonts though it is right on the web in the same language the 
expert spoke in.


My work irritates some may be because it is an affront their belief 
that they know all and decide all. Some  feel let down why they could 
not think of it earlier and may be write about a strange discovery 
like Abiguda and write a book on the nonsense. Most of all, I think it 
is a just cultural block on this side of the globe.


As for Lankan technocrats, their worry is that the purpose of ICTA 
would come unraveled.  I went there in November and it was revealed to 
me (by one of its employees) that its purpose is to provide a single 
point of contact for foreign vendors that can use local experts as 
their advocates.



On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no 
mailto:xn--mlform-...@xn--mlform-iua.no wrote:


Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800:
 On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:
 Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
 The Web archive for this very list, needs a fix as well …


 The way to formally request any action by the Unicode Consortium is
 via the contact form (found on the home page).

Good idea. Done!

Turned out to only be - it seems to me - an issue of mislabeling the

Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Leif Halvard Silli
Naena Guru, Tue, 8 Jan 2013 15:56:52 -0600:

 The statement,
 
 the death of most character sets makes everyone's systems smaller and
 faster
 
 is *FALSE*. Compare the sizes of the following two files that are copies of
 a newspaper article. The top part in red has few more words in romanized
 Singhala in the romanized Singhala file. Notice the size of each file:
 1. http://ahangama.com/jc/uniSinDemo.htm   size:38,092 bytes
 2. http://ahangama.com/jc/RSDemo.htm   size:18,922 bytes
  [ … ]
 Again *demonstrably WRONG*

To double check your statement, I saved the above tow pages in Safari’s 
webarchive format[1] and compared the resulting size of each archive 
file. The benefit of doing such a comparison is that we then get to 
count both the HTML page *plus* all the extra fonts that is included in 
the romanized Singhala file. Thus, we get a more *real* basis for 
comparing the relative size of the two pages. Here are the results:

1. http://ahangama.com/jc/uniSinDemo.htm, webarchive size: 205 459 bytes
2. http://ahangama.com/jc/RSDemo.htm, webarchive size: 223 201 bytes

As you can see, the romanized Singhala file looses - it becomes 
bigger than the UTF-8 version. I suppose the reason for this is that 
for the romanized Singhala file, then the folder has to download 
fonts in order to display the romanized Singhala. (It tried to do the 
same in Firefox, using its ability to save the complete page, however 
it did for some reason not work). 

I also ran a test on both pages with the YSlow service.[2] Here are the 
total weight of each page, according to YSlow, when run from Firefox:

1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 92.7K
2. http://ahangama.com/jc/RSDemo.htm, YSlow size: 65.7K

And here are the YSlow results from Safari:

1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 11.2K
2. http://ahangama.com/jc/RSDemo.htm, YSlow size:  9.0K

Rather interesting that Safari and Firefox differs that much. But 
anyhow, the YSlow results are pretty clear, and demonstrates that while 
the romanized Singhala page is smaller, it is only between 20 and 30 
percent smaller than the Unicode page. 

However, despite the slightly bigger size, YSlow in Firefox (don't know 
how to see it in Safari) *still* reported that the Unicode page loaded 
faster!

Further more, when I inspected the source code of these to documents, 
then I discovered that for the the Unicode file, you included *two* 
downloadable fonts, whereas for the romanized Singhala page, you only 
included *one* downloadable font. (Why? Because both files actually 
contains some romanized Singhala!). Before we can *really* take those 
two test pages seriously, you must make sure that both pages use the 
same amount of fonts! As it is, then i strongly suspect that if you had 
included the same amount of downloadable fonts in both pages, then the 
Unicode page would have won.

Of course, the romanized Singhala page has many usability problems as 
well: 1) It doesn't work with screen readers (users will hear the text 
as latin text), 2) it doesn’t work with Find-in-page search (users will 
type in Sinhala, but since the content is actually Latin, they won’t 
find anything on the page), 3) the title of the romanized Singhala 
page is (I believe) not actually readable as Singhala, 4) there are 
many browsers in which the romanized Singhala file will not display: 
text browsers, Opera and any browser where CSS is disabled. 5) You get 
all kinds of problems for form submission.

Conclusion: Your claims about the file size advantage of romanized 
Singhala seems grossly exaggerated, if at all true, based as they are 
on a test of two files which aren actually equal when it comes to the 
extra CSS stuff that they embed.

[1] http://en.wikipedia.org/wiki/Webarchive
[2] http://yslow.org/
-- 
leif halvard silli




Re: Interoperability is getting better ... What does that mean?

2013-01-02 Thread Leif Halvard Silli
Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800:
 On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:
 Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
 The Web archive for this very list, needs a fix as well … 
 
 
 The way to formally request any action by the Unicode Consortium is 
 via the contact form (found on the home page).

Good idea. Done!

Turned out to only be - it seems to me - an issue of mislabeling the 
monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very 
messages themselves are archived correctly. And thus I made the request 
that they properly label the index pages.

Happy new year!
-- 
leif h silli




Re: Interoperability is getting better ... What does that mean?

2013-01-01 Thread Naena Guru
It used to be that during HTML 4 days ISO8859-1 was the default character
set for pages that used SBCS (those that belong to Basic Latin and Latin
Extended-1). At least that is what the Validator (http://validator.w3.org/)
said.

(By the way, Unicode is quietly suppressing Basic Latin block by removing
it from the Latin group at top of the code block page (
http://www.unicode.org/charts/) and hiding it under different names in the
lower part of the page.)

Now the validator complains correctly that some characters in those pages
do not belong to ISO-8859-1, if you use bullet points, ellipse etc. It says
they come from Windows-1252. That is true. If you declare these pages as
UFT-8, then it throws off *all* Latin-1 characters and the web pages show
character-not-found glyph.

Windows-1252 replaces all Control codes (first 32 characters) in Latin-1
page with some common characters used by Eastern European languages and
some punctuation marks.

There is one main consideration in the mind of the web developer: Make the
file as small as possible. Try this: Make a text file in Windows Notepad
and save it in ANSI, Unicode and UTF-8 formats. ANSI file (Windows-1252)
will be the smallest. Why should people make their pages larger just to
satisfy some peoples idea of perfection? It reminds me of the Plain Text
and language detection myths.


On Mon, Dec 31, 2012 at 8:44 AM, Asmus Freytag asm...@ix.netcom.com wrote:

 On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:

 Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
 The Web archive for this very list, needs a fix as well …



 The way to formally request any action by the Unicode Consortium is via
 the contact form (found on the home page).

 A./




Re: Interoperability is getting better ... What does that mean?

2013-01-01 Thread David Starner
On Tue, Jan 1, 2013 at 3:53 PM, Naena Guru naenag...@gmail.com wrote:
 Now the validator complains correctly that some characters in those pages do
 not belong to ISO-8859-1, if you use bullet points, ellipse etc. It says
 they come from Windows-1252. That is true. If you declare these pages as
 UFT-8, then it throws off *all* Latin-1 characters and the web pages show
 character-not-found glyph.

And if I declare myself an English citizen, getting through borders
takes a lot longer. You have to declare the pages to be what they are,
which means converting all the characters to be the proper character
set.

 There is one main consideration in the mind of the web developer: Make the
 file as small as possible.

I'm looking at Gmail, and found that pretty wood background is a half
a megabyte. If you still believe that squeezing every last byte is the
goal, I suggest that modern web design has passed you by. Looking at
most modern websites, they're spending hundreds or thousands of
kilobytes to communicate stuff that could have been done in way less
space. The Wikipedia front page is downloading 680K of Javascript; try
minimizing that before messing with anything else.

 Try this: Make a text file in Windows Notepad and
 save it in ANSI, Unicode and UTF-8 formats. ANSI file (Windows-1252) will be
 the smallest.

Not necessarily. If you've converting Arabic or Greek or Cyrillic to
HTML escapes, you're going to take up more space that way. If you're
dropping it, well, duh, throwing away data will generally save you
space.

 Why should people make their pages larger just to satisfy some
 peoples idea of perfection?

CP-1252 is a perfectly legal web character set, and nobody is going to
argue with you if you want to use it in legal ways. (I.e. writing
Latin script in it, not Sinhala.) But the death of most character sets
makes everyone's systems smaller and faster and more likely to
correctly show them the document instead of trash.

-- 
Kie ekzistas vivo, ekzistas espero.



Re: Interoperability is getting better ... What does that mean?

2012-12-31 Thread Leif Halvard Silli
Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
 On 12/30/2012 3:19 PM, Leif Halvard Silli wrote:
 My feeling is that interoperability is getting better everywhere. 
 But one field which lags behind is e-mail. Especially Web archives of
 e-mail (for instance, take the WHATwg.org’s web archive). And also 
 some e-mail programs fail to default to UTF-8.
 
 Archiving seems to occasionally destroy whatever settings made the 
 original work. I have seen that not only with e-mail, but also with 
 forums that have a separate, archive format.
 
 Time to get those tools to move to UTF-8.

The Web archive for this very list, needs a fix as well …
-- 
leif h silli




Re: Interoperability is getting better ... What does that mean?

2012-12-31 Thread Asmus Freytag

On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:

Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
The Web archive for this very list, needs a fix as well … 



The way to formally request any action by the Unicode Consortium is via 
the contact form (found on the home page).


A./



Interoperability is getting better ... What does that mean?

2012-12-30 Thread Costello, Roger L.
Hi Folks,

I have heard it stated that, in the context of character encoding and decoding:

Interoperability is getting better.

Do you have data to back up the assertion that interoperability is getting 
better?

Below is a summary of my understanding of interoperability. Would you inform me 
of any misunderstandings please?

---
Interoperability of Text (i.e., Character Encoding Interoperability)
---
Remember not long ago you would visit a web page and see strange characters 
like this:

“Good morning, Daveâ€

You don't see that anymore. 

Why?

The answer is this:

Interoperability is getting better.

In the context of character encoding and decoding, what does that mean?

Interoperability means that you and I interpret (decode) the bytes in the same 
way.

Example: I create text file, encode all the characters in it using UTF-8, and 
send the text file to you. 

Here is a graphical depiction (i.e., glyphs) of the bytes that I send to you:

López

You receive my text document and interpret the bytes as iso-8859-1. 

In UTF-8 the ó symbol is a graphical depiction of the LATIN SMALL LETTER O 
WITH ACUTE character and it is encoded using these two bytes: C3 B3

But in iso-8859-1, the two bytes C3 B3 is the encoding of two characters:

 C3 is the encoding of the à character
 B3 is the encoding of the ³ character

Thus you interpret my text as:

López

We are interpreting the same text (i.e., the same set of bytes) differently.

Interoperability has failed.

So when we say: 

Interoperability is getting better.

we mean that the number of incidences of senders and receivers interpreting the 
same bytes differently is decreasing.  

Let's revisit our first example. You go to a web site and see this: 

 “Good morning, Daveâ€

Here's how that happened:

I use Microsoft Word (character set, Windows-1252) to create a web page 
containing this text document:

“Good morning, Dave”

Notice that I wrapped the greeting in Microsoft smart quotes. 

You visit my web page.

Suppose your browser is set to interpret all web pages as iso-8859-15.

In Windows-1252 the left smart quote is hex: 93

In Windows-1252 the right smart quote is hex: 84

In iso-8859-15 there are no characters assigned to either hex 93 or hex 84.  

So your browser replaces the left smart quote (hex 93) with hex E2 (â) followed 
by hex A4 (€) followed by hex BD (œ).

And your browser replaces the right smart quote (hex 84) with hex E2 (â) 
followed by hex A4 (€). 

The result is that you see this on your browser screen:

“Good morning, Daveâ€







Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Asmus Freytag

On 12/30/2012 1:22 PM, Costello, Roger L. wrote:

Hi Folks,

I have heard it stated that, in the context of character encoding and decoding:

 Interoperability is getting better.

Do you have data to back up the assertion that interoperability is getting 
better?


The number of times that I receive e-mail or open web sites in other 
languages or scripts WITHOUT seeing garbled characters or boxes has 
definitely increased for me. That would be my personal observation.


More people are sending me material in other scripts and languages. 
whether on this list or via social media. Interoperability as measured 
in those terms has clearly improved as well; again, as experienced 
personally.


I still see the occasional garbled characters, most often because of a 
Latin-1/Latin-15 mismatch with UTF-8. Interoperability is not perfect. 
There's also no real reason to continue to create material in those 
8-bit sets, especially, if the data is mislabeled as UTF-8 (or sometimes 
vice versa).


In my experience, the rate of incidence for these appears to be going 
down as well, but I'm personally not running an actual count. I can 
imagine that there are places (and software configurations) that expose 
some users to higher rates of incidence than I am experiencing.


Rather than dissecting general statements such as whether 
Interoperability is getting better or not, it seems more productive to 
address specific shortcomings of particular content providers or tools.


In the final analysis, what counts is whether users can send and receive 
text with the lowest possible rate of problems - and if that requires 
transition away from certain legacy practices, it would be important to 
focus the energies on making sure that such transition takes place.


A./




Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Jukka K. Korpela

2012-12-30 23:22, Costello, Roger L. wrote:


I have heard it stated that, in the context of character encoding and decoding:

 Interoperability is getting better.


Where? It seems that this is what *you* are saying.


Do you have data to back up the assertion that interoperability is getting 
better?


Do you?


Below is a summary of my understanding of interoperability.


This seems to revolve around just the encoding of web pages, 
specifically the problem that sometimes the encoding has not been 
properly declared.


I haven’t seen any data on the relative frequency of such problems, and 
I don’t know what such data would be useful for.


But in my experience, such problems have been become more common, mainly 
because people using different encodings. One reason is that people 
think UTF-8 is favored but don’t quite know how to use it, e.g. 
declaring UTF-8 but using an authoring tool that does no actually 
produce UTF-8 encoded data.


Yucca




Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Marc Blanchet

Le 2012-12-30 à 17:41, Jukka K. Korpela a écrit :

 2012-12-30 23:22, Costello, Roger L. wrote:
 
 I have heard it stated that, in the context of character encoding and 
 decoding:
 
 Interoperability is getting better.
 
 Where? It seems that this is what *you* are saying.
 
 Do you have data to back up the assertion that interoperability is getting 
 better?
 
 Do you?
 
 Below is a summary of my understanding of interoperability.
 
 This seems to revolve around just the encoding of web pages, specifically the 
 problem that sometimes the encoding has not been properly declared.
 
 I haven’t seen any data on the relative frequency of such problems, and I 
 don’t know what such data would be useful for.
 
 But in my experience, such problems have been become more common, mainly 
 because people using different encodings. One reason is that people think 
 UTF-8 is favored but don’t quite know how to use it, e.g. declaring UTF-8 but 
 using an authoring tool that does no actually produce UTF-8 encoded data.

not my experience. I agree with Asmus that overall, things are getting better.

Marc.

 
 Yucca





Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Leif Halvard Silli
Jukka K. Korpela, Mon, 31 Dec 2012 00:41:41 +0200:
 2012-12-30 23:22, Costello, Roger L. wrote:
 
 I have heard it stated that, in the context of character encoding 
 and decoding:
 
  Interoperability is getting better.
 [ … ]
 This seems to revolve around just the encoding of web pages, 
 specifically the problem that sometimes the encoding has not been 
 properly declared.
 
 I haven’t seen any data on the relative frequency of such problems, 
 and I don’t know what such data would be useful for.
 
 But in my experience, such problems have been become more common, 
 mainly because people using different encodings. One reason is that 
 people think UTF-8 is favored but don’t quite know how to use it, 
 e.g. declaring UTF-8 but using an authoring tool that does no 
 actually produce UTF-8 encoded data.

My feeling is that interoperability is getting better everywhere. But 
one field which lags behind is e-mail. Especially Web archives of 
e-mail (for instance, take the WHATwg.org’s web archive). And also some 
e-mail programs fail to default to UTF-8.

Inter op is getting better because

 1. we move towards one encoding (UTF-8)
 2. an aspect of 1. is that we put more 
restrictions on ourselves - we respect the
conventions. E.g. HTML5 blesses Win-1252
as the real default.
3. we understand the problem(s) better. (E.g.
   I used to think that it was good if a tool 
   supported multiple encodings - and in a way
   it is good, but … it is much more important
   that the tool defaults to UTF-8.)

There probably most productive is to file bugs against each an every 
tool that doesn’t default to UTF-8.
-- 
leif halvard silli




Re: Interoperability is getting better ... What does that mean?

2012-12-30 Thread Asmus Freytag

On 12/30/2012 3:19 PM, Leif Halvard Silli wrote:

My feeling is that interoperability is getting better everywhere. But one 
field which lags behind is e-mail. Especially Web archives of
e-mail (for instance, take the WHATwg.org’s web archive). And also some e-mail 
programs fail to default to UTF-8.


Archiving seems to occasionally destroy whatever settings made the 
original work. I have seen that not only with e-mail, but also with 
forums that have a separate, archive format.


Time to get those tools to move to UTF-8.

A./