On Fri May 19 19:13:39 CDT 2006, [EMAIL PROTECTED] wrote:
> > no. the unicode sequences (e.g. U+0069 U+0361) are correct.
> > i checked this and several other examples with the actual books.
>
> How did you check it ? Visual inspection ?
since these were actual books, i know of no other way. ;-)
> Since I'm no expert
> in UNICODE I'm quite curious to know how one is supposed to
> tell between a real character and a combination of a diacritic
> and some other character when they are visually indistinguishable ?
say i have a random accented letter. suppose that U+x is the cp for
the letter. suppose U+y is the cp for the accent. suppose that we're lucky
and there exists U+w ≡ U+xU+y. then U+w should be the same glyph
as U+xU+y.
cannonical composition would yield
compose(U+xU+y) U+w
compose(U+w) U+w
while cannonical decompostion would yield
decompose(U+xU+y) U+xU+y
decompose(U+w) U+xU+y
> I would expect unicode to always favor single glyphs from a particular
> page over anything else.
it's always a single glyph. don't confuse letters, codepoints, and glyphs.
>
> btw, could you send me a .png with the actual title ?
i'll send you a png of the character. i don't have the books.
what language rule are you trying to get at?
- erik
>
> > i think you misunderstand how unicode works.
>
> That could very well be the case ;-) But I know how Russian language
> works regardless of what committee members think.
>
> > a base cp like U+0069 followed by a combining cp like U+0361
> > make a single character. this identification is called "composition".
> > unicode contains some precomposed cps, but not U+0069 U+0361.
>
> That's ok. My only point is -- I would expect anybody who enters
> titles into a database adhere to the rules of the language the
> title is written in. Maybe its too much to expect, though.
>
> Thanks,
> Roman.
>