Re: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Martin J. Dürst

On 2013/01/08 14:43, Stephan Stiller wrote:


Wouldn't the clean way be to ensure valid strings (only) when they're
built


Of course, the earlier erroneous data gets caught, the better. The 
problem is that error checking is expensive, both in lines of code and 
in execution time (I think there is data showing that in any real-life 
programs, more than 50% or 80% or so is error checking, but I forgot the 
details).


So indeed as Ken has explained with a very good example, it doesn't make 
sense to check at every corner.



and then make sure that string algorithms (only) preserve
well-formedness of input?

Perhaps this is how the system grew, but it seems to be that it's
yet another legacy of C pointer arithmetic and
about convenience of implementation rather than a
safety or performance issue.


Convenience of implementation is an important aspect in programming.

 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.
 So in this kind of a case, what we are actually dealing with is:
 garbage in, principled, correct results out. ;-)

Sorry, but I have to disagree here. If a list of strings contains items 
with lone surrogates (garbage), then sorting them doesn't make the 
garbage go away, even if the items may be sorted in correct order 
according to some criterion.


Regards,   Martin.



Re: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Stephan Stiller
 Wouldn't the clean way be to ensure valid strings (only) when they're
 built


 Of course, the earlier erroneous data gets caught, the better. The problem
 is that error checking is expensive, both in lines of code and in execution
 time (I think there is data showing that in any real-life programs, more
 than 50% or 80% or so is error checking, but I forgot the details).

 So indeed as Ken has explained with a very good example, it doesn't make
 sense to check at every corner.


What I meant: The idea was to check only when a string is constructed. As
soon as it's been fed into a collation/whatever algorithm, the algorithm
should assume the original input was well-formed and shouldn't do any more
error-checking, yes.

Not having facilities for dealing with ill-formed values (U+D800 ..
U+DFFF) in an algorithm will surely make *something* faster, even if it's
just some table that's being used indirectly having fewer entries.

What I had in mind is a library where the public interface only ever allows
Unicode scalar values to be in- and output. This will lead to a cleaner
interface. A data structure that can hold surrogate values can and should
be used algorithm-*internally*, if that makes things more efficient, safer,
etc.

Convenience of implementation is an important aspect in programming.


For a user yes, but not for a library writer/maintainer, I would suggest.
The STL uses red-black trees; these are annoyingly difficult to implement
but invisible to the user.

Stephan


Re: Q is a Roman numeral?

2013-01-08 Thread Frédéric Grosshans

Le 08/01/2013 01:26, Ben Scarborough a écrit :

This isn't directly related to Unicode, but I thought this would be a
good place to ask.

Specifically, I'm curious about figure 14 (Gordon 1982) from WG2 N3218
[http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3218.pdf], which says:

Whereas our so-called Arabic numerals
are ten in number (0–9), the Roman nu-
merals number nine: I = 1 (one), V = 5, X
= 10, L = 50, C = 100, Đ = 500 (D reg-
ularly with middle bar, the modern form
being simply D), a symbol for 1,000 (see
below), Q = 500,000, and a rather strange
symbol for 6: ↅ.

Now that Q = 500,000 bit seems a little odd to me. I've never seen
that anywhere else. Does anyone know where it came from? Is there real
usage of Q for 500,000?
Roman numerals have always been more complex than the standard (modern) 
way we've been taught to, and their use spans several millennia, over 
which may variation have occurred. If you look at wiipedia's table for 
middle age and Renaissance, 
http://en.wikipedia.org/wiki/Roman_numeral#Middle_Ages.2FRenaissance , 
you'll see that many letters of the alphabet have been used as Roman 
numerals. In this table, Q is supposed to stand for 500, but this is not 
necessarily in contradiction with 500,000, since there were several ways 
to go beyond 1000...


As a side note on non-standard Roman numeral, I've seen 80,000 written 
IVXXM (like quatre vingt mille) in an old french edition of the 
Arthurial cycle.


Frédéric




RE: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Whistler, Ken
 Sorry, but I have to disagree here. If a list of strings contains items
 with lone surrogates (garbage), then sorting them doesn't make the
 garbage go away, even if the items may be sorted in correct order
 according to some criterion.

Well, yeah, I wasn't claiming that the principled, correct output made the 
garbage go away.

Let me put it this way: if my choices are 1) garbage in, garbage reliably 
sorted out into garbage bin, versus 2) garbage in, sorting fails with 
exception, then I'll pick #1. ;-)

To give a concrete example, my implementation of UCA reliably passes the 
SHIFTED test cases in the conformance test, even though those test cases 
(deliberately) contain some ill-formed strings. If I instead did validation 
testing on input strings in my base implementation, it would be slower, *and* 
to pass the conformance test I would have to add a separate preprocessing stage 
that probed all the input data for ill-formed strings and filtered those cases 
out before engaging the test, so that it wouldn't fail with an exception when 
it hit the bad data. 

--Ken





Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Naena Guru
Thank you for commenting and Happy New Year.

CP-1252 is a perfectly legal web character set, and nobody is going to
argue with you if you want to use it in legal ways. (I.e. writing
Latin script in it, not Sinhala.) But .

Okay, what is implied is I am doing something illegal. Define what I am
doing that is illegal and cite the rule and its purpose of preventing what
harm to whom.

May I ask if the following two are Latin script, English or Singhala?

1. This is written in English.
2. mee laþingaþa síhalayi.


For me, both are Latin script and 1 is English and 2 is Singhala (says,'
this is romanized Singhala').

The fo;;owing are the *only* Singhala language web pages that pass HTML
validation (Challenge me):
http://www.lovatasinhala.com/
They are in romanized Singhala.

The statement,

the death of most character sets makes everyone's systems smaller and
faster

is *FALSE*. Compare the sizes of the following two files that are copies of
a newspaper article. The top part in red has few more words in romanized
Singhala in the romanized Singhala file. Notice the size of each file:
1. http://ahangama.com/jc/uniSinDemo.htm  size:38,092 bytes
2. http://ahangama.com/jc/RSDemo.htm  size:18,922 bytes
As the size of the page grows, the size of Unicode Sinhala tends to double
the size relative to its romanized Singhala version. Unicode Sinhala
characters become 50% larger when UTF-8 encoded for transmission  That is
three times the size of the romanized Singhala file. So, the Unicode
Sinhala file consumes 3 times the bandwidth needed to send the romanized
Singhala file.

more likely to correctly show them the document instead of trash

Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does
not have the fonts. It is trash also if the font used by the OS is
improperly made, such as in iPhone. It is generally trash because the
SLS1134 standard corrupts at least one writing convention. (Brandy issue).
On the other hand, romanized Singhala is always readable whether you have
the font or not. It is not helpful to criticize Singhala related things
without making a serious effort to understand the issues. Blind men thought
different things about the elephant.

If you mean that everyone should start using 16-bit Unicode characters, I
have no objection to that. It would happen if and when all applications
implement it. I cannot fight that even if I want to. But I do not see users
of English doing anything different to what they are doing now, like my
typing now, I think, using 8-bit characters. (I can verify that by copying
it and pasting into a text editor.

I showed that the Singhala can be romanized and all the problems of
ill-conceived Unicode Indic can be eliminated by carefully studying the
grammar of the language and romanizing. (I used the word 'transliterate'
earlier, but the correct word is transcribe). I did it for Singhala and
made an Open Type font to show it perfectly in the traditional Singhala
script. So far, one RS smartfont and six Unicode fonts even after spending
$20M for a foreign expert to tell how to make fonts though it is right on
the web in the same language the expert spoke in.

My work irritates some may be because it is an affront their belief that
they know all and decide all. Some  feel let down why they could not think
of it earlier and may be write about a strange discovery like Abiguda and
write a book on the nonsense. Most of all, I think it is a just cultural
block on this side of the globe.

As for Lankan technocrats, their worry is that the purpose of ICTA would
come unraveled.  I went there in November and it was revealed to me (by one
of its employees) that its purpose is to provide a single point of contact
for foreign vendors that can use local experts as their advocates.


On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no wrote:

 Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800:
  On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:
  Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
  The Web archive for this very list, needs a fix as well …
 
 
  The way to formally request any action by the Unicode Consortium is
  via the contact form (found on the home page).

 Good idea. Done!

 Turned out to only be - it seems to me - an issue of mislabeling the
 monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very
 messages themselves are archived correctly. And thus I made the request
 that they properly label the index pages.

 Happy new year!
 --
 leif h silli





Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Jukka K. Korpela

2013-01-08 23:56, Naena Guru wrote:


May I ask if the following two are Latin script, English or Singhala?

1. This is written in English.
2. mee laþingaþa síhalayi.

For me, both are Latin script and 1 is English and 2 is Singhala (says,'
this is romanized Singhala').


Text 2 is “romanized Singhala” only by your private definition, and you 
don’t even mean that. You are not actually promoting the use of Latin 
letters to write Sinhala but to use a private 8-bit encoding for 
Sinhala. You expect such a font to be used that the letter “a” is not 
displayed as “a” but as something completely different, as a Sinhala 
character.


It seems that your agenda here is something very different from the 
Subject line you use – not about generalities, but about certain 
fontistic trickery.



http://www.lovatasinhala.com/


If you look at the title of the page as displayed in a browser’s tab 
header or equivalent, you see “nivahal heøa”. This is what happens when 
the font trickery fails (because browsers use their fixed fonts to 
display such items).


The trickery is nothing new. It was used even when you had to use font 
face on the web to use fonts, and at that time, the trickery was 
analyzed and found wanting, see e.g.

http://alis.isoc.org/web_ml/html/fontface.en.html
There’s no reason to go into such analyses any more.

If you are happy with this or that trickery and don’t want them to be 
analyzed, just use them. But please don’t expect the rest of the world 
to go back to bad old days.


Yucca





Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Charlie Ruland

I for one am so glad we now have Unicode.

I remember when in pre-Unicode days my then-girlfriend was writing a PhD 
thesis in German about Russian linguistics.  She had fonts for both 
alphabets, but due to technical limitations the different letters had to 
share the same code points.  And at one point somehow the correct 
formatting got lost in her word processor...  A complete and utter disaster!


You are not serious, are you?

Charlie


* Naena Guru naenag...@gmail.com [2013-01-08 22:56]:


Thank you for commenting and Happy New Year.

CP-1252 is a perfectly legal web character set, and nobody is going to
argue with you if you want to use it in legal ways. (I.e. writing
Latin script in it, not Sinhala.) But .

Okay, what is implied is I am doing something illegal. Define what I 
am doing that is illegal and cite the rule and its purpose of 
preventing what harm to whom.


May I ask if the following two are Latin script, English or Singhala?

1. This is written in English.
2. mee laþingaþa síhalayi.


For me, both are Latin script and 1 is English and 2 is Singhala 
(says,' this is romanized Singhala').


The fo;;owing are the *only* Singhala language web pages that pass 
HTML validation (Challenge me):

http://www.lovatasinhala.com/
They are in romanized Singhala.

The statement,

the death of most character sets makes everyone's systems smaller
and faster

is *FALSE*. Compare the sizes of the following two files that are 
copies of a newspaper article. The top part in red has few more words 
in romanized Singhala in the romanized Singhala file. Notice the size 
of each file:
1. http://ahangama.com/jc/uniSinDemo.htm  size:38,092 
bytes
2. http://ahangama.com/jc/RSDemo.htm  size:18,922 
bytes
As the size of the page grows, the size of Unicode Sinhala tends to 
double the size relative to its romanized Singhala version. Unicode 
Sinhala characters become 50% larger when UTF-8 encoded 
for transmission  That is three times the size of the romanized 
Singhala file. So, the Unicode Sinhala file consumes 3 times the 
bandwidth needed to send the romanized Singhala file.


more likely to correctly show them the document instead of trash

Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that 
does not have the fonts. It is trash also if the font used by the OS 
is improperly made, such as in iPhone. It is generally trash because 
the SLS1134 standard corrupts at least one writing convention. (Brandy 
issue). On the other hand, romanized Singhala is always readable 
whether you have the font or not. It is not helpful to criticize 
Singhala related things without making a serious effort to understand 
the issues. Blind men thought different things about the elephant.


If you mean that everyone should start using 16-bit Unicode 
characters, I have no objection to that. It would happen if and 
when all applications implement it. I cannot fight that even if I want 
to. But I do not see users of English doing anything different to what 
they are doing now, like my typing now, I think, using 8-bit 
characters. (I can verify that by copying it and pasting into a text 
editor.


I showed that the Singhala can be romanized and all the problems of 
ill-conceived Unicode Indic can be eliminated by carefully studying 
the grammar of the language and romanizing. (I used the word 
'transliterate' earlier, but the correct word is transcribe). I did it 
for Singhala and made an Open Type font to show it perfectly in the 
traditional Singhala script. So far, one RS smartfont and six Unicode 
fonts even after spending $20M for a foreign expert to tell how to 
make fonts though it is right on the web in the same language the 
expert spoke in.


My work irritates some may be because it is an affront their belief 
that they know all and decide all. Some  feel let down why they could 
not think of it earlier and may be write about a strange discovery 
like Abiguda and write a book on the nonsense. Most of all, I think it 
is a just cultural block on this side of the globe.


As for Lankan technocrats, their worry is that the purpose of ICTA 
would come unraveled.  I went there in November and it was revealed to 
me (by one of its employees) that its purpose is to provide a single 
point of contact for foreign vendors that can use local experts as 
their advocates.



On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no 
mailto:xn--mlform-...@xn--mlform-iua.no wrote:


Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800:
 On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:
 Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
 The Web archive for this very list, needs a fix as well …


 The way to formally request any action by the Unicode Consortium is
 via the contact form (found on the home page).

Good idea. Done!

Turned out to only be - it seems to me - an issue of mislabeling the

Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Leif Halvard Silli
Naena Guru, Tue, 8 Jan 2013 15:56:52 -0600:

 The statement,
 
 the death of most character sets makes everyone's systems smaller and
 faster
 
 is *FALSE*. Compare the sizes of the following two files that are copies of
 a newspaper article. The top part in red has few more words in romanized
 Singhala in the romanized Singhala file. Notice the size of each file:
 1. http://ahangama.com/jc/uniSinDemo.htm   size:38,092 bytes
 2. http://ahangama.com/jc/RSDemo.htm   size:18,922 bytes
  [ … ]
 Again *demonstrably WRONG*

To double check your statement, I saved the above tow pages in Safari’s 
webarchive format[1] and compared the resulting size of each archive 
file. The benefit of doing such a comparison is that we then get to 
count both the HTML page *plus* all the extra fonts that is included in 
the romanized Singhala file. Thus, we get a more *real* basis for 
comparing the relative size of the two pages. Here are the results:

1. http://ahangama.com/jc/uniSinDemo.htm, webarchive size: 205 459 bytes
2. http://ahangama.com/jc/RSDemo.htm, webarchive size: 223 201 bytes

As you can see, the romanized Singhala file looses - it becomes 
bigger than the UTF-8 version. I suppose the reason for this is that 
for the romanized Singhala file, then the folder has to download 
fonts in order to display the romanized Singhala. (It tried to do the 
same in Firefox, using its ability to save the complete page, however 
it did for some reason not work). 

I also ran a test on both pages with the YSlow service.[2] Here are the 
total weight of each page, according to YSlow, when run from Firefox:

1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 92.7K
2. http://ahangama.com/jc/RSDemo.htm, YSlow size: 65.7K

And here are the YSlow results from Safari:

1. http://ahangama.com/jc/uniSinDemo.htm, YSlow size: 11.2K
2. http://ahangama.com/jc/RSDemo.htm, YSlow size:  9.0K

Rather interesting that Safari and Firefox differs that much. But 
anyhow, the YSlow results are pretty clear, and demonstrates that while 
the romanized Singhala page is smaller, it is only between 20 and 30 
percent smaller than the Unicode page. 

However, despite the slightly bigger size, YSlow in Firefox (don't know 
how to see it in Safari) *still* reported that the Unicode page loaded 
faster!

Further more, when I inspected the source code of these to documents, 
then I discovered that for the the Unicode file, you included *two* 
downloadable fonts, whereas for the romanized Singhala page, you only 
included *one* downloadable font. (Why? Because both files actually 
contains some romanized Singhala!). Before we can *really* take those 
two test pages seriously, you must make sure that both pages use the 
same amount of fonts! As it is, then i strongly suspect that if you had 
included the same amount of downloadable fonts in both pages, then the 
Unicode page would have won.

Of course, the romanized Singhala page has many usability problems as 
well: 1) It doesn't work with screen readers (users will hear the text 
as latin text), 2) it doesn’t work with Find-in-page search (users will 
type in Sinhala, but since the content is actually Latin, they won’t 
find anything on the page), 3) the title of the romanized Singhala 
page is (I believe) not actually readable as Singhala, 4) there are 
many browsers in which the romanized Singhala file will not display: 
text browsers, Opera and any browser where CSS is disabled. 5) You get 
all kinds of problems for form submission.

Conclusion: Your claims about the file size advantage of romanized 
Singhala seems grossly exaggerated, if at all true, based as they are 
on a test of two files which aren actually equal when it comes to the 
extra CSS stuff that they embed.

[1] http://en.wikipedia.org/wiki/Webarchive
[2] http://yslow.org/
-- 
leif halvard silli




Mark Crispin (1956-2012)

2013-01-08 Thread Michael Everson
Farewell to Mark Crispin, a true friend of Unicode. 

http://en.wikipedia.org/wiki/Mark_Crispin

https://www.ietf.org/mail-archive/web/imap5/current/msg00571.html

Michael Everson * http://www.evertype.com/