Sorry, but I meant comparing with collation. I did not mean comapring
within labguage context.

How can you do /proper/ collation while ignoring the language context?

1) 'sıkıcı' which means 'boring' in English (notice the dotless small
'i's)

2) 'sikici' which means 'fucker' in English

Depends how you normalize. Normalize should sbstitute all *equal*
letters (or combination thereof) into one single form. That allows
comparing and matching them.

Again, we're not quite on the same page here...

What you're referring is more like 'Text Normalization' [ http://en.wikipedia.org/wiki/Text_normalization ] where you do definitely need a very comprehensive dictionary so that '1' is equal to 'one' and '1st' is 'first', etc. (if your language is English).

Whereas, what I am referring to is 'Unicode Normalization' [ http://en.wikipedia.org/wiki/Unicode_normalization ].

This one is much narrower in scope. It deals basically with what I can refer to as 'character glyphs'.

Now, from what I understand from the definitions of 'Unicode Normalization' there are 2 ways of doing it:

1) You decompose both texts (so that you have all 'weird' characters ezpanded into their combining characters)

2) You compose both texts (so that, you have as few or no combining characters)

This is done, obviously, to get them both in the same format --to make life easier to compare.

If you do no other operation on these two texts before you compare them, this is called Canonical Equivalnece Test --each 'character glyph' in each text must be the same.

For Canonical Equivalnece Test, you do not need to have any 'language' attribute --afer all, you're doing a simple byte-wise test.

On the other hand, if you wish to do a broader comparison, Compatibility Equivalnece Test or something other, you will need to do a little more work on those texts:

Normalization is one of them. I suggest you take a look at the 'Normalization' heading under http://en.wikipedia.org/wiki/Unicode_normalization

Trouble with the 'Normalization' described there is, it is far too crude for quite a lot of purposes.

A better form of comparison is, converting both text to either uppercase or to lowercase.

And, once we do this, we hit two walls (or obstacles) to overcome. The steps I can think of are:

1) Equivalent code points. We need first to 'compose' the text and then substitute the relevant (and preferred) equivalent code points for any 'character glyph's in the texts.

2) We also need to take care of stuff like language dependent case transforms. See http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I

As far as I know, this is the only 'proper' thing to do for search and comparison operations under unicode.

I know it will be slower, but, that is the price to pay.

Note: The reason I used the term 'character glyphs' is because, several codepoint can be combined to make a 'character glyph'.

See the definition of Code Point [ http://unicode.org/glossary/ ] which says:

"Code Point: Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16."

As an example, from the above Wiki article, we can use 2 code points to produce a 'character glyph', such as

'n' + '~' --> ñ

But yes, even this is very limited (busstop), because even if you know
the language of the wort (german in my example) you do not know its
meaning.

You do not worry about the meaning at all. In all languages (I guess) there are several words that may be written the same but mean different things.

Without a full dictionary, you do not know if ss and german-sharp-s are
the same or not.

True. But, if you do know it is in German, then you definitely know they are. And, this makes a lot of difference.

So basically what you want to do, can only be done with a full
dictionary. Or you have to accept false positives.

Nope. No false positives in text level.

You can always, of course, get false positives in semantic level --such as when you're looking for 'apple' (the fruit) and 'Apple' (the brand name), but that's a completely different problem.

I also fail to see why a utf8 string is a half baked solution. It will
serve most people fine. It can be extended for those who want more.

I have nothing against UFT-8 or any other encoding schemes. It is just that --en encoding scheme. Most handy as a means of transport data from one medium/app to another.

But, UFT-8 does in no way cover the whole of Unicode or is a complete solution for dealing with unicode. It is, after all, an encoding scheme.

BUT of course there is no way do deal with the ambitious "Busstop"

Not even if you knew that "Busstop" was a german string?

In deed. For this case, you need to know what language "Busstop" was
written in.
you need a dictionary. knowing it is German is not enough. because all
that "it is german" tells you is, that "ss" maybe a sharp-s, but doesn't
have to be

A dictionary, then, wouldn't help you either because all it could tell you is that it could be either a loan word or a native word.

True, I am happy to do that. NOT
I am glad we have met :)
have we? I remember a mail conversation, but not an actual meeting :) SCNR

Well we haven't met face to face; but (in this discussion) we seem to have met at a common point. :)

Of course, these and even more creative hacks could be devised.
The question is, is the language an attribute of a unicode character?
(I assume "mandatory attribute")

Well as much as it is or is not an attribute of a latin1 or iso-whatever
char.

Well.. Does it have to be Latin1?

I keep giving you Turkish examples.

And, Turkish --hold on to your seats-- Latin5 or ISO-8859-9 [ http://en.wikipedia.org/wiki/ISO-8859-9 ]

I do not think it is. I have no proof. But a lot of people seem to think
so, if I goggle Unicode (or any other char/latin./iso...) I get nice
character tables; and no language info.

See the link above [ http://en.wikipedia.org/wiki/ISO-8859-9 ].

For some reason, they felt the need to say "ISO 8859-9, also known as Latin-5 or 'Turkish'" :)

Similarly, for ISO-8859-7, [ http://en.wikipedia.org/wiki/ISO_8859-7 ],
they had to say this: "ISO 8859-7, also known as Greek".

Do you still think character sets are independent of languages?

Question:

Does the fact that there is something called 'Unicode' mean we have invented a whole new language that rules them all, or does just mean that it is a pool of all (most) known alphabets?

If it is the latter, you still need to know in what language is a given piece of string in that alphabet soup..


_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to