--On Wednesday, 21 November, 2001 09:08 -0800 Kenneth Whistler <[EMAIL PROTECTED]> wrote:
>> We must though be very careful not to inadvertently exclude >> scripts/characters that are used by some languages even >> though we thought they were merely symbols. > > The list you are looking for is provided by the Unicode > Consortium: > > http://www.unicode.org/Public/UNIDATA/Scripts.txt > > That gives script assignments for Unicode characters (Latin, > Greek, Cyrillic, Devanagari, Bengali, Han, ...), and provides >... > Note that many scripts inherently include combining > characters. I absolutely agree with Kent that a blanket > prohibition of combining characters is unacceptable. In a > discussion dominated by English, Chinese, and Korean > speaker/writers, it might seem o.k., but I assure you that if > there were as many Arabic, Urdu, Hindi, and Bengali > speaker/writers participating, it would *not* seem o.k. Ken, I may not have been reading closely enough, but I don't believe this discussion has included a proposal to ban combining characters. I do have an issue with them, but I think it is separable (see below). > Otherwise, deciding to omit punctuation, space characters, > format control characters, and symbols is fine as a > conservative approach to the problem, however. Good to hear this. The combining character problem (if it is a problem) is that, so far, we have no proposals on the table that would require that a DNS label be a valid name in any particular language, or even that it be drawn, homogeneously, from any particular script. Until and unless one of those rules is made (my guess is that it would be nearly impossible to do so, but this is not my area of expertise), we are thrown back on the traditional DNS rule that, subject to the hyphen-placement rule, any valid character of the chosen CCS can appear in any relationship to any other valid character of the CCS. In particular, there is no way to require or assume script-homogeniety. If, to use your example, we have a selection of Arabic, Urdu, Hindi, and Bengali, which characters from each script designed by its first character, AUHBBHUA ought to be a valid label. While we know how to construct AAAA, UUUU, HHHH, and BBBB, regardless of whether a given character is combining or non-combining, I worry about interpretation and ambiguity if combining characters (or partial breaks, etc.) are taken from one of these scripts and surrounded by characters from an unrelated script. Maybe it is not a problem, but I'd like someone to assure me that is the case. john
