Re: unicode Digest V2 #52

2002-02-27 Thread Stefan Probst

Hello again,

my apologies if me language challenged caused some of you to regard my post 
about threads as a threat ;)

Stefan

At 23:54 26.02.2002 -0500, I wrote:
-
>Date: Tue, 26 Feb 2002 16:49:52 +0700
>From: Stefan Probst <[EMAIL PROTECTED]>
>Subject: Recent Threats
>
>
>Good Evening,
>
>can somebody pls. explain to me dummy, what the long threats about
>R(o|u)mania, Canada, California, Yankees, and Initials in various
>countries..
> have to do with Unicode?
>
>Maybe I don't understand the deeper implications not yet? ;)
>
>Cheers,
>Stefan





Recent Threats

2002-02-26 Thread Stefan Probst

Good Evening,

can somebody pls. explain to me dummy, what the long threats about 
R(o|u)mania, Canada, California, Yankees, and Initials in various 
countries..
 have to do with Unicode?

Maybe I don't understand the deeper implications not yet? ;)

Cheers,
Stefan





RE: Unicode Search Engines

2002-02-20 Thread Stefan Probst

Good Morning,

There is a new version of the relevant document (still in draft) out since 
yesterday. You may want to check it:
http://www.w3.org/TR/charmod/

It is stated there under 4.3:
>[S] [I] A text-processing component that receives suspect text MUST NOT 
>perform any normalization-sensitive operations unless it has first 
>successfully validated the text for normalization, and MUST NOT normalize 
>the suspect text.


At 17:55 20.02.2002 +0100, Marco Cimarosti wrote:
-

>What (canonical or compatibility) *composition* normalization does is
>converting a sequence to a precomposed character, *if* one exist. What W3C
>says when they mandate composition is that a sequence like "a" + combining
>tilde has to be converted to the single code point "a with tilde".

Yes. And since there is a precomposed character for "A WITH CIRCUMFLEX", a 
web browser obviously should reject a combination of "A" "COMBINING 
CIRCUMFLEX", and not try to normalize it to NFC like IE5.5 does.

>Why? Isn't that what W3C asked? The only risk is that the normalization is a
>waste of time, because it has already been made by the server.

According to my understanding the server should NOT do any normalization, 
but the authoring tools should do it.

>BTW, are you sure that it is NFKC? My understanding is that it was NFC +
>some extra passages. For instance, would superscript numbers like "²" be
>turned to "2"? I hope not, as that would break lots of web pages.

Correct. Webcontent is NFC..., IDNs will be NFKC. Sorry, I mixed that.

>The renderer is a DLL called Uniscribe, which is not shipped by default with
>all MS operating systems. I think that 95, 98 and ME only receive by
>installing IE support for some languages.

Thank you. Good hint. I will check that.

> > Anybody experiences with other OSs / other characters?
>
>To my experience, most Windows NT apps behave like your description: things
>are better only inside IE. But Windows 2000 seems to work fine out of the
>box, in most applications.

It is planned, that by mid 2002, Unicode (actually a national standard, 
which is based on Unicode) will be compulsory standard in all State Offices 
in Vietnam. If all users had to upgrade to an unknown OS (Win2k) and maybe 
even purchase new equipment, because their existing one is too slow for the 
new OS, then we will make no friends with the introduction of Unicode

Cheers,
Stefan





Re: Unicode Search Engines

2002-02-20 Thread Stefan Probst

Hello Doug,

Actually, it seems like IE would do it like you describe: try to normalize 
to NFC/NFKC and display that. MS Word does not. When looking in different 
sizes, the glyphs look quite ugly, since they are really combined: The dot 
below for example is only sometimes exactly below the vowel, often it is 
too far left or right.

According to what you write, the renderer in my combination seems really 
broken for the word processors (MS Word and OpenOffice), since it cannot 
display the combining modifiers.
Regarding IE: The "a and i with horn" might not be used right now and 
therefore acceptable. But that it is not able to display the "space with 
modifiers" is less acceptable.
On the other side, there seems actually no need to display non-NFKC for the 
Web, since as far as I understand, W3C is planning to make NFKC a 
requirement for the Web. By trying to normalize the input (the combining 
sequences to NFKC) IE might even work against planned W3C rules.

Assuming, that the renderer is part of the OS and used by most - if not all 
- applications, I conclude, that Windows ME is not able to handle the 
combining modifier characters. Anybody experiences with other OSs / other 
characters?

Stefan



At 21:52 18.02.2002 -0800, Doug Ewell wrote:
-
>In theory, a fully conformant Unicode renderer is supposed to be able to
>combine an arbitrary base character with arbitrary combining marks.  The
>renderer is supposed to look at the glyphs and decide how to combine them
>dynamically so they look reasonable together.  So you should be able to
>combine "o with horn," "a with horn," or "q with horn" and get the
>expected result.
>
>In the real world, it doesn't work like that.  Renderers detect sequences
>of base+combining characters, look for an equivalent precomposed form, and
>display that instead.  For example, they detect U+006F (o) followed by
>U+031B (combining horn), and instead of trying to figure out how to
>combine them, simply generate U+01A1 (o with horn) instead.  This results
>in a nice-looking precomposed glyph (if it's in the font) with a lot less
>work.  But it means that U+0061 (a) plus U+031B (combining horn) can't be
>displayed properly, since there is no precomposed code point for "a with
>horn."
>
>In the '90s, when UTC and WG2 were more open to encoding precomposed
>forms, this approach was not too problematic, since any legitimate
>diacriticized character in an alphabetic script probably had its own
>precomposed form.  Today, because of normalization considerations, we are
>probably not going to see any more precomposed characters that can already
>be formed with combining sequences.  So if some language turns out to need
>"a with horn" in the future, its readers will have to cross its fingers
>that rendering engines become capable of displaying U+0061 U+031B
>properly.
>
>-Doug Ewell
>  Fullerton, California





Re: Unicode Search Engines

2002-02-18 Thread Stefan Probst

At 30 Jan 2002 11:38:37 -0500, John Cowan <[EMAIL PROTECTED]> wrote:
-
>Stefan Probst wrote:
> > And since we are already in Vietnamese (to round the things up):
> > I am not sure, how e.g. in the introduction to dictionaries or
> > Vietnamese language books, the tonal mark can be printed "alone". One
> > solution might be to combine them with a "space", but at present, this
> > does not work always.
>
>When does it not?  It is the standard Unicode thing to do.

Well, I tried it with:
a) the Vietnamese "tonal marks":
- grave   U+0300  combining class: 230
- hook above  U+0309  combining class: 230
- tilde   U+0303  combining class: 230
- acute   U+0301  combining class: 230
- dot below   U+0323  combining class: 220

b) the Vietnamese "modifier" characters:
- breve   U+0306  combining class: 230
- circumflex  U+0302  combining class: 230
- hornU+031B  combining class: 216

I tried to combine them with the space character and with some vowels.

The tonal marks went usually quite fine, but the modifier characters did not:
In WinME, they did not work in MSWindows97, OpenOffice641.
In IE5.5 they did not work with the space, and only with the "right 
combination" of vowels and modifiers:
OK: (all vowels a,e,i,o,u) + (any of breve or circumflex)
OK: o + horn, u + horn (which are in fact valid Vietnamese characters)
NOT OK: a + horn, e + horn, i + horn (which actually are not valid 
Vietnamese characters)

Are the described issues a problem of the OS (e.g. rendering engine), 
application (why does IE behave different from Word?), or correct Unicode 
implementation (e.g. that the horn does not combine with a,e,i)?


Best Regards,
Stefan






Re: Unicode Search Engines

2002-01-30 Thread Stefan Probst

Hello Doug,

concluding from how well you understood the issue (including your case 5), 
one could think, you were Vietnamese ;)

It is exactly the "dot below" which makes the most problems, since its 
combining class (220) is lower than some of the modifiers (230).
And unfortunately other tonal marks have the same combining class like 
modifiers (230), and therefore the sorting seems to be not even specified!

To have the information together:
The modifiers, which change the base character to form a new character:
breve   U+0306  combining class: 230
circumflex  U+0302  combining class: 230
hornU+031B  combining class: 216
The tonal marks, which have only a very loose connection with the character 
(i.e. in handwriting they are often even placed above two adjacent vowels):
grave   U+0300  combining class: 230
hook above  U+0309  combining class: 230
tilde   U+0303  combining class: 230
acute   U+0301  combining class: 230
dot below   U+0323  combining class: 220

I made already test pages, e.g. the one at
http://www.isoc-vn.org/www/standard/normalizationtest13.html

The issue runs even a bit further:

(1) Sorting
It is said, that in sorting, all combining marks should be disregarded.
While in Vietnamese this is OK for the (combining) tone marks, it is 
absolutely not OK for the (combining) modifiers. In Vietnamese, e.g. an "a" 
with "circumflex" is a completely different character than an "a" alone.
This is, why some circles in Vietnam prefer what I call "VN-combined": base 
character and modifier pre-composed, tone mark combining.
(2) Converting
Inside of Vietnam, in the past, there were mainly two different encodings used:
- "TCVN-ABC": Fully pre-composed, but a separate font for some upper case 
characters
- "VNI": Mainly using combining characters
When converting old documents (office and web) to Unicode, the question 
will be, whether the tools will do any normalization (especially in case of 
VNI), or just only re-map [combining] character by [combining] character.

And to make things worse, it seems, that MS prefers the combining way, 
saying that their sorting, spell check, word wrap etc. works that way

Vietnam plans to make Unicode compulsory for state offices by middle of 2002.
I have been asked to advise, and volunteered to take mainly care about 
Internet issues.

Right now, in Vietnam they are still discussing, whether they should 
require a specific normalization, and if so, which one of the four possible 
candidates.

According to W3C's draft at http://www.w3.org/TR/charmod/#sec-Normalization 
it seems, that all Web Applications (and that might include search 
engines?) should reject (to be precise: MUST NOT handle) everything which 
is not NFC. This could mean, that search engines MUST NOT index pages in 
"not NFC" and reject queries in "not NFC". If they do: fine. If not: then 
we have probably quite some problems...


And since we are already in Vietnamese (to round the things up):
I am not sure, how e.g. in the introduction to dictionaries or Vietnamese 
language books, the tonal mark can be printed "alone". One solution might 
be to combine them with a "space", but at present, this does not work always.
And only some of the tonal marks seem to have a "stand-alone version", e.g. 
U+02CB for the "grave".

Best Regards,
Stefan


At 01:29 30.01.2002 -0500, [EMAIL PROTECTED] wrote:
-
>In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
>[EMAIL PROTECTED] writes:
>
> > I would like to add:
> > How do they handle normalization?
> > In Vietnam, many characters can be represented in several different ways:
> > (1) fully precomposed (NFC)
> > (2) base character and modifier precomposed, tonal mark combining
> > (3) base character, then modifier, then tonal mark
> > (4) like (3), but modifier and tonal mark sorted (NFD)
> > Do the search engines do any normalization, before indexing a page?
> > Are queries normalized before running the search?
>
>I'm not sure what sort of normalization might be performed by search engines,
>but I want to examine the Vietnamese decomposition aspect for a moment.
>
>If you have a Vietnamese vowel with both modifier and tone mark, say LATIN
>CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in
>Unicode in at least three ways:
>
>(1) fully precomposed (NFC) -- that is, U+1EA4
>(2) base character and modifier precomposed, tonal mark combining -- that is,
>U+00C2 U+0301
>(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302
>U+0301
>
>So far, so good.  But then we have:
>
> > (4) like (3), but modifier and tonal mark sorted (NFD)
>
>If "sorting" the diacritical marks in NFD results in rearranging the two
>diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of
>Vietnamese orthography, the NFD form may not really be a legitimate way of
>representing the Vietnamese letter.
>
>For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT

Re: Unicode Search Engines

2002-01-28 Thread Stefan Probst

On Wed Jan 16 23:49:29 2002 +0400 Aman Chawla wrote:
>Are there any search engines at all at present which allow one to search 
>sites encoded in UTF-8? If not, are there plans to build such search 
>engines? For example, is Google going to implement such an engine?

I would like to add:
How do they handle normalization?
In Vietnam, many characters can be represented in several different ways:
(1) fully precomposed (NFC)
(2) base character and modifier precomposed, tonal mark combining
(3) base character, then modifier, then tonal mark
(4) like (3), but modifier and tonal mark sorted (NFD)
Do the search engines do any normalization, before indexing a page?
Are queries normalized before running the search?

In other words:
For example, if the page is written in NFC, but the query is entered in 
NFD, will it find anything?

Rgds,
Stefan