Re: [fpc-devel] Unicodestring branch, please test and help fixing

listmember Fri, 12 Sep 2008 10:23:19 -0700

Sorry, but I meant comparing with collation. I did not mean comapring
within labguage context.


How can you do /proper/ collation while ignoring the language context?

1) 'sıkıcı' which means 'boring' in English (notice the dotless small
'i's)

2) 'sikici' which means 'fucker' in English

Depends how you normalize. Normalize should sbstitute all *equal*
letters (or combination thereof) into one single form. That allows
comparing and matching them.


Again, we're not quite on the same page here...

What you're referring is more like 'Text Normalization' [http://en.wikipedia.org/wiki/Text_normalization ] where you dodefinitely need a very comprehensive dictionary so that '1' is equal to'one' and '1st' is 'first', etc. (if your language is English).

Whereas, what I am referring to is 'Unicode Normalization' [http://en.wikipedia.org/wiki/Unicode_normalization ].

This one is much narrower in scope. It deals basically with what I canrefer to as 'character glyphs'.

Now, from what I understand from the definitions of 'UnicodeNormalization' there are 2 ways of doing it:

1) You decompose both texts (so that you have all 'weird' charactersezpanded into their combining characters)

2) You compose both texts (so that, you have as few or no combiningcharacters)

This is done, obviously, to get them both in the same format --to makelife easier to compare.

If you do no other operation on these two texts before you compare them,this is called Canonical Equivalnece Test --each 'character glyph' ineach text must be the same.

For Canonical Equivalnece Test, you do not need to have any 'language'attribute --afer all, you're doing a simple byte-wise test.

On the other hand, if you wish to do a broader comparison,Compatibility Equivalnece Test or something other, you will need to do alittle more work on those texts:

Normalization is one of them. I suggest you take a look at the'Normalization' heading underhttp://en.wikipedia.org/wiki/Unicode_normalization

Trouble with the 'Normalization' described there is, it is far too crudefor quite a lot of purposes.

A better form of comparison is, converting both text to either uppercaseor to lowercase.

And, once we do this, we hit two walls (or obstacles) to overcome. Thesteps I can think of are:

1) Equivalent code points. We need first to 'compose' the text and thensubstitute the relevant (and preferred) equivalent code points for any'character glyph's in the texts.

2) We also need to take care of stuff like language dependent casetransforms. See http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I

As far as I know, this is the only 'proper' thing to do for search andcomparison operations under unicode.


I know it will be slower, but, that is the price to pay.

Note: The reason I used the term 'character glyphs' is because, severalcodepoint can be combined to make a 'character glyph'.

See the definition of Code Point [ http://unicode.org/glossary/ ] whichsays:

"Code Point: Any value in the Unicode codespace; that is, the range ofintegers from 0 to 10FFFF16."

As an example, from the above Wiki article, we can use 2 code points toproduce a 'character glyph', such as


'n' + '~' --> ñ

But yes, even this is very limited (busstop), because even if you know
the language of the wort (german in my example) you do not know its
meaning.

You do not worry about the meaning at all. In all languages (I guess)there are several words that may be written the same but mean differentthings.

Without a full dictionary, you do not know if ss and german-sharp-s are
the same or not.

True. But, if you do know it is in German, then you definitely know theyare. And, this makes a lot of difference.

So basically what you want to do, can only be done with a full
dictionary. Or you have to accept false positives.


Nope. No false positives in text level.

You can always, of course, get false positives in semantic level --suchas when you're looking for 'apple' (the fruit) and 'Apple' (the brandname), but that's a completely different problem.

I also fail to see why a utf8 string is a half baked solution. It will
serve most people fine. It can be extended for those who want more.

I have nothing against UFT-8 or any other encoding schemes. It is justthat --en encoding scheme. Most handy as a means of transport data fromone medium/app to another.

But, UFT-8 does in no way cover the whole of Unicode or is a completesolution for dealing with unicode. It is, after all, an encoding scheme.

BUT of course there is no way do deal with the ambitious "Busstop"


Not even if you knew that "Busstop" was a german string?

In deed. For this case, you need to know what language "Busstop" was
written in.

you need a dictionary. knowing it is German is not enough. because all
that "it is german" tells you is, that "ss" maybe a sharp-s, but doesn't
have to be

A dictionary, then, wouldn't help you either because all it could tellyou is that it could be either a loan word or a native word.

True, I am happy to do that. NOT

I am glad we have met :)

have we? I remember a mail conversation, but not an actual meeting :) SCNR

Well we haven't met face to face; but (in this discussion) we seem tohave met at a common point. :)

Of course, these and even more creative hacks could be devised.
The question is, is the language an attribute of a unicode character?

(I assume "mandatory attribute")

Well as much as it is or is not an attribute of a latin1 or iso-whatever
char.


Well.. Does it have to be Latin1?

I keep giving you Turkish examples.

And, Turkish --hold on to your seats-- Latin5 or ISO-8859-9 [http://en.wikipedia.org/wiki/ISO-8859-9 ]

I do not think it is. I have no proof. But a lot of people seem to think
so, if I goggle Unicode (or any other char/latin./iso...) I get nice
character tables; and no language info.


See the link above [ http://en.wikipedia.org/wiki/ISO-8859-9 ].

For some reason, they felt the need to say "ISO 8859-9, also known asLatin-5 or 'Turkish'" :)


Similarly, for ISO-8859-7, [ http://en.wikipedia.org/wiki/ISO_8859-7 ],
they had to say this: "ISO 8859-7, also known as Greek".

Do you still think character sets are independent of languages?

Question:

Does the fact that there is something called 'Unicode' mean we haveinvented a whole new language that rules them all, or does just meanthat it is a pool of all (most) known alphabets?

If it is the latter, you still need to know in what language is a givenpiece of string in that alphabet soup..



_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Reply via email to