Hi, Today at 13:44, srintuar wrote:
> As for serbian, I dont think that really has much to do with unicode > itself. You could apply a special folding algorithm when doing > searches in a serbian context, but I dont think you would want to make > the script ambiguous. I'd rather make script ambiguous, than make language ambiguous :) FWIW, I can already produce perfectly fine fonts which will display Serbian encoded using U0400-U045f in either of Latin or Cyrillic scripts: choice of script is commonly personal preference in Serbia, not a property of text. Of course, this would be a misuse of Unicode :) Sheer number of available Cyrillic-to-Latin and Latin-to-Cyrillic converters available for Serbian proves that this is so. Backwards compatibility is one of big Unicode problems when it comes to consistency. Unicode is utterly inconsistent, and thus very hard to implement fully (for both processing and display purposes). From practical standpoint, this makes sense (nobody would start using Unicode in the first place if it didn't have such properties), but it doesn't mean it's a "good" property (whatever global "good" means), only that it works to some extent. New policies such as "no more precomposed glyphs" also indicate that we're talking about glyph repository, not about character repository (i.e. "no more precomposed glyphs, since you can get those glyphs by combining existing glyphs", even though they may have entirely different meaning and be separate characters in their own right). > Defining a "Serbian Unified" code range would probably be futile: > Even after such a range was defined, little could stop someone from > representing serbian in cyrillic or latin script. If Serbian was only encodeable with "Serbian range", it'd be much better. Keyword "representing" is to be treated as "displayed" in Serbian language. It's much better if script is external to the encoding (eg. script is defined through metadata, not through data), than if language is external to the encoding (that applies to other languages too, see my example for English and Spanish below). After all, try looking at how many pages Google mistags as Russian or Macedonian which are actually in Serbian (Google ignores metadata itself, so that might be the culprit). > (then you would have three strings to search for intstead of one or > two) If you have two official scripts, you have two official scripts. Wrong: official has nothing to do with it (btw, official is in regards to country, not language, and only one script is official in Serbia: Cyrillic script). In Serbian, "two different scripts" are a display property. Think of glyph vs. character debate in Unicode sense (Unicode encodes characters, i.e. their meanings, not their looks). So, when text is in Serbian, "Ð" is exactly the same as "b". It's not "b in Serbian Cyrillic" and "b in Serbian Latin", it's simply "b". We can talk about display property of such a text as being in Serbian Latin or Serbian Cyrillic just like we can talk about a display property of it being in Garamond or Arial, Bold or Italic. Unicode was about meaning, right? > As for CJK un-unification, I find that highly unlikely. Even if for > some reason certain codepoints were unambiguated, that would still > fit within a backwards compatible modification of unicode, and not a > complete scrapping of it. Indeed, and that's what I aimed for: Unicode will not be expanded such that it solves all the problems of character encoding, and we'd need something else for that. Or, if I'm not yet clear, Unicode is not an end-point in evolution of encodings. Unicode, as it is, is closer to common glyph repository (AFII anyone?) than character repository (ok, backwards compatibility is also responsible for this, because of things like English ligatures, etc). FWIW, I'd assert that "j" in Spanish is not the same thing as "j" in English (and that one is easily proved), apart from them being represented with the same *glyph*. Of course, with Unicode, it's current practice to add language marks in a text stream instead (eg. with XML using 'xml:lang' tagging, or global metadata such as MIME header fields), but that beats one big advantage of Unicode: you can read it with a stateless machine, from any point in the stream (big win for network applications). This holds even for transformation format such as UTF-8 (Unicode itself is no big deal without UTF-8, IMHO). Ideal case, IMO, would be a character encoding standard where one can deduce all properties of a character from itself (i.e. when my parser runs across "Ð" it will know that it is a "lowercase b in Serbian" or "lowercase b in Russian"). Technically, this character encoding could be mapped to something like Unicode or AFII for display purposes. Of course, this brings in some problems in input area. Yet, many more advantages would come with such encoding, such as better/faster collation (one can choose to ignore language bits of a character to get current behaviour, which is highly unpredictable when one changes a locale) etc. > Speaking of political issues: if you advocate > un-unification, I'd be interested to hear which language's script you > would place closest to zero. I don't really mind. There're many objective criteria that could be established (i.e. by number of representable characters [slowest first/largest first], approximate number of worldwide users at the time of introduction, first-come-first-served, etc). I'd have more problems with Unicode if I actually cared about such things like what comes "closer to zero" :) Of course, I'm aware that changes of this kind are very far-fetched, and would require substantial changes in our systems. That's not going to happen anytime soon. This requires a change in thinking too ("what do you mean Spanish j is different from English j? but they're the same!"). There're many problems like expandability (i.e. you wouldn't be able to use new language even though it might use the same glyphs some other language uses, but this is more of a "bootstrapping" problem), so I'm not saying this would be the end-point in evolution of encodings either. This doesn't apply only to multi-script languages, but it might be that we see the problem more clearly :-) I'm simply saying that this kind of encoding has some properties that are better than what Unicode has to offer, and that Unicode is far from being the evolution end-point either. Perhaps my idea of how it should look like is not very good, but I outline problems which could also be solved with another encoding and character set. So, lets see what future holds in store for us :) Also, DOS wasn't deprecated in short time span either. It took at least 10 years for it to be considered deprecated (ok, some considered it deprecated right when it was released :). > Changes to unicode dont require any change to UTF-8 itself. > Sacrificing native script programming comments and literals isnt worth > it for something that may never change, imo. I don't get what this has to do with Unicode/UTF-8 being The Ultimate among encodings. It's only the best we've got so far. Sorry for this long rant about Unicode: I don't feel bad about it, and UTF-8 is about the only encoding I use for all its magnificient properties, but I still see problems with it :) Cheers, Danilo PS. Note that I misused "encoding" to sometimes mean "character set" as well. Hope this won't be too confusing :) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/