Keith J. Schultz wrote: > Hi Phillip, > > 1) I do not know Vietnamese! > > 2) If I did uses the proper BMP would give me the answer. > As "sang would be a sequence of singualr octcets, and Vietnamese > would use multi-byte sequences! > > case closed! Like I mentioned there are often ways used to reduce the length > of > the multibyte sequences. In that case one has to know the processed use to > get the proper > unicode character code!
It is not necessary to "know" a language in order to be able to algorithmically determine in which language a particular stretch of text is written, if such algorithmic determination is possible. I do not "know" Hebrew, but even I know that "בית דין" is Hebrew and that "你好" is not. What I do not know (and what I challenge you to tell us" is whether "sang" is English or Vietnamese. You wrote : "for efficiency reasons, utf-8 strings are not properly encoded and programs assume a particular language, to save space." I invited you to tell us (the XeTeX list members, that is) what would be a "properly encoded utf-8 string" for the sequence "sang" which would enable a computer algorithm to determine whether that string was "sang" (Vietnamese) or "sang" (English). I am still hoping that you will be able to tell us what that properly encoded utf-8 string is, rather than just metaphorically waving your arms in the air while throwing around phrases such as "proper BMP", "singular octets" and "multi-byte sequences". Philip Taylor -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex