Re: New version of TR29:
Mark Davis wrote: > There is a new version of Unicode Technical Report #29: Text Boundaries on > <http://www.unicode.org/reports/tr29/>, covering grapheme-cluster, word and > sentence boundaries. There are significant modifications to this version; > for a summary, see <http://www.unicode.org/reports/tr29/#Modifications>. > This is a draft version, not a final version. There are a number of open > issues remaining. Feedback is welcome > Feedback that is received before the UTC meeting (starting August 20) can be > made available for the discussion of TR29 at that meeting. FYI: There're an open issue regarding grapheme-cluster boundaries in Thai. * SARA AM as an Other_Grapheme_Extend? Whether "0E33;THAI CHARACTER SARA AM" should be a GraphemeExtend character or not? By Unicode definition, SARA AM is an Lo, not a combining character. But many Thai applications (MS Office/ Windows/ OpenOffice.org) treats SARA AM like a combining character (unlike SARA AA), i.e. cursor always jump over it. Whether this is right or not is controversial but the fact is that Windows users are used to it. My personal question is that, if it is favorable for Thai to treat SARA AM as part of the previous grapheme cluster, is it possible for UTC to consider adding SARA AM as an Other_Grapheme_Extend? --- I also notice that Grapheme_Link is removed from the grapheme-cluster definition. This is appropriate for Thai because PHINTHU should not cause two grapheme clusters to be linked together. -- Feel free to disclose the contents of this message. Regards, Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: logical order (and input method)
Kenneth Whistler wrote: > The "Indic model" is largely based on an abstraction of > the phonology of the language the script is used to write > The "Thai model" is a typewriter-derived variant of the > Indic model that rules out reordrant or surroundrant characters, > because of the limitations of typewriter technology Just a curiosity, I'm a Thai and used to the Thai model so I'm wondering how other brahmi-derived scripts are 1) typed on typewriter 2) typed on computer keyboard 3) hand-written That is, are they all using the same (logical) order? -- Feel free to disclose the contents of this message. Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
logical order (and Thai)
Kenneth Whistler wrote: > Ummm. Logical order, visual order, aural order, phonemic order, > linear order... We are in danger of losing track of the ground we > stand on. Totally agree. > Logical order versus visual order, in the Unicode Standard, > refers to the relationship between backing store order and > display order. The main issue is for bidirectional text. Fortunately, this is put clear enough in the Unicode book, > There is a separate issue which has to do with alternative > models of Brahmi-derived scripts. > The "Indic model" ... > The "Thai model" ... > Note, however, that *both* of these models inherently imply non-linear > mappings at some level. In the Indic model, the mapping from > phonology to backing store is straightforward, but the mapping > from backing store to display (i.e., the "rendering") will > have local direction reversals and/or 1-2 character-to-glyph > mappings, in the case of reordrant or surroundrant vowels. > The Thai model displaces the mapping complexity to the > mapping from phonology to backing store, while simplifying the > rendering. But this is not. It would be easier to avoid these confusions if the above description about "non-linear mapping" of Brahmi-derived scripts was written clearly in Chapter 2 of the book in the section about logical order. > Given this picture, it should now be easier to see why Thai > rendering is easier than Devanagari, but Thai sorting > (which runs afoul of the mismatch between phonology and > backing store order) in more problematical. It is simply > a tradeoff of which level of processing gets the complexity. Does this mean that there's nothing illogical or less-prefered with the Thai model? If so, please also consider the following question (a little bit rephrased) Original Message Subject: Logical_Order_Exception actually means Phonetic_Order_Exception ? Date: Sat, 01 Jun 2002 12:00:09 +0700 From: Samphan Raruenrom <[EMAIL PROTECTED]> Organization: NECTEC To: Unicode Public List <[EMAIL PROTECTED]> CC: Thai IT Standards Newsgroup <[EMAIL PROTECTED]>, Virach Sornlertlamvanich <[EMAIL PROTECTED]>, Trin Tansetthi <[EMAIL PROTECTED]>, Suwit Srivilairith <[EMAIL PROTECTED]> It's said (below) that ALL scripts in Unicode are stored in 'logical order'. And for the most part, logical order corresponds to 'phonetic order'. And the only exceptions are Thai and Lao. Do you think that Logical_Order_Exception should actually be called Phonetic_Order_Exception? 8<- References --->8 The definition of this newly introduced property in Unicode 3.2 :- http://www.unicode.org/unicode/reports/tr28/#database Logical_Order_Exception: There are a small number of characters (in the Thai and Lao scripts) that do not use logical order. These characters require special handling in most processing. The difinition of Logical Order :- The Unicode Standard 3.0 : Section 2.2 Unicode Design Principles Logical Order: For "ALL" scripts, Unicode text is stored in 'logical order' in the memory representation, roughly corresponing to the order in which text is typed in via the keyboard. ... For the most part, logical order corresponds to 'phonetic order'. The only current exceptions are the Thai and Lao scripts, which employ visual ordering; in these two scripts, users traditionnally type in visual order rather than phonetic order. The followings are the only Logical_Order_Excention in Unicode 3.2 :- http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt 0E40..0E44; Logical_Order_Exception # Lo [5] THAI CHARACTER SARAE .. THAI CHARACTER SARA AI MAIMALAI 0EC0..0EC4; Logical_Order_Exception # Lo [5] LAO VOWEL SIGN E .. LAO VOWEL SIGN AI -- Feel free to forward or quote to any individual or public. Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: Is UniCode's Thai character representation is acceptable by TISI or not?
Dear Mark, Thanks for informative reply. :) Mark Davis wrote: > Some comments below. > - Original Message - > From: "Samphan Raruenrom" <[EMAIL PROTECTED]> > To: "Asmus Freytag" <[EMAIL PROTECTED]> > Cc: "Sreedhar M" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; "Rick McGowan" ><[EMAIL PROTECTED]> > Sent: Tuesday, July 16, 2002 07:22 > Subject: Re: Is UniCode's Thai character representation is acceptable by TISI or not? >>Asmus Freytag wrote: >>>At 12:06 PM 7/16/02 +0700, Samphan Raruenrom wrote: >>Problems from Unicode properties >>- error in combining class of vowel signs make normalization worthless >> in some cases. This is important if you want to compare strings. > Meaning: the normalized forms of two strings are not equal in cases > where Thais would consider them equal, right? Definitely. >>- decomposition of SARA AM add more problem to normalization > I don't recall seeing that note; I'll look forward to your report. Please see my discussion with khun Peter Constable quoted below. >>- some properties make grapheme cluster for Thai >> imcompatible with the way Thai expect, e.g PINTHU as >> virama, SARA AM not a combining character > In the last UTC, action was taken that is not yet in the draft TR on > boundaries. In particular, this affects Thai. Glad to hear that :) >>Inaccuracy in the Unicode book >>- backspace 'always' use the same (grapheme cluster) character boundary >> as Del and left/right arrow. Actually Thai use backspace to delete single >> character not the whole cluster. So character boundary for backspace should >> be locale specific. > This text will be overriden by the TR. Great! >>- in Thai, zero width space is said to be able to expand in full-justified >> paragraph. Actually it is always zero width. > There may be some misunderstanding here. What is meant is: if you had > the sequence ABCD, and between the B and the C was a zero-width space, > AND you were inter-character spacing for justification, you would not > expect to see: > A BC D > Instead, you would expect to see > ABCD > That is, the zero-width space does not prevent the characters from > using inter-character spacing. Sorry for misunderstanding that. A short explanation/example like this in the book (chapter 9), will help a lot. >>These are things you have to khow after learning the Unicode standard >>if you plan to work with Thai language, to 'code around' the problem >>to make it acceptable for Thai people. >>I plan to write a formal report on the issue, not to change the standard, >>but to note what is wrong and what have to be code around. So people >>who like to work with Thai language (like you) will know the right thing >>to do and not repeat the same mistake as in some softwares. Original Message Subject: Re: Fixed position combining classes Date: Thu, 06 Jun 2002 21:53:35 +0700 From: Samphan Raruenrom <[EMAIL PROTECTED]> Organization: NECTEC To: [EMAIL PROTECTED] CC: Arthit Suriyawongkul <[EMAIL PROTECTED]>,Suwit Srivilairith <[EMAIL PROTECTED]>, Thai IT Standards Newsgroup <[EMAIL PROTECTED]>, Trin Tansetthi <[EMAIL PROTECTED]>, Unicode Public List <[EMAIL PROTECTED]>, Virach Sornlertlamvanich <[EMAIL PROTECTED]> References: <[EMAIL PROTECTED]> [EMAIL PROTECTED] wrote: > Now, the problem with the sequences above is that they are visually > indistinct, meaning that they could not possibly be used by users for a > semantically-relevant distinction. From the user's perspective, they are > identical. Moreover, it would fit a user's expectations to have string > comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a > match if the data contains < 0e39, 0e35 >). They are both > canonically-ordered sequences, however, since U+0E35 has a combining class > of 0. The result is that string comparisons that rely on normalisation into > any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC) > will fail to consider these as equal. Let's talk about somethings that really happend in Thai. 1) 0E01;THAI CHARACTER KO KAI;Lo;0 0E38;THAI CHARACTER SARA U;Mn;103 0E4D;THAI CHARACTER NIKHAHIT;Mn;0 The sequences (which happend in Pali transcription) (a) KO KAI + SARA U + NIKHAHIT (b) KO KAI + NIKHAHIT + SARA U They're look the same but not equal because combining class of NIKHAHIT happend to be 0 so both are normalized. 2) 0E32;THAI CHARACTER SARA AA;Lo;0 0E48;THAI CHARACTER MAI EK;Mn;107 0E33;THAI CHARACTER
Re: Is UniCode's Thai character representation is acceptable by TISI or not?
Asmus Freytag wrote: > At 12:06 PM 7/16/02 +0700, Samphan Raruenrom wrote: >> There're some mistakes in Unicode char. >> properties for Thai char. and you have to "code around" that. > And the mistakes are? I've discussed a few of them here in this list. I'll write a more formal report on the issue later. Here're some titles Problems from Unicode properties - error in combining class of vowel signs make normalization worthless in some cases. This is important if you want to compare strings. - decomposition of SARA AM add more problem to normalization - some properties make grapheme cluster for Thai imcompatible with the way Thai expect, e.g PINTHU as virama, SARA AM not a combining character Inaccuracy in the Unicode book - backspace 'always' use the same (grapheme cluster) character boundary as Del and left/right arrow. Actually Thai use backspace to delete single character not the whole cluster. So character boundary for backspace should be locale specific. - in Thai, zero width space is said to be able to expand in full-justified paragraph. Actually it is always zero width. These are things you have to khow after learning the Unicode standard if you plan to work with Thai language, to 'code around' the problem to make it acceptable for Thai people. I plan to write a formal report on the issue, not to change the standard, but to note what is wrong and what have to be code around. So people who like to work with Thai language (like you) will know the right thing to do and not repeat the same mistake as in some softwares. -- Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: Is UniCode's Thai character representation is acceptable by TISI or not?
Sreedhar M wrote: > Thank U for Your kind response.Please let me know whether > Unicode's Thai character represation is acceptable by TISI or not? It is > very essential to our project. Yes. TISI had taken part in the representation of Thai char. in ISO 10646 (and Unicode indirectly). Unicode has backward-compatibility goal so it takes the whole Thai block in TIS-620 to Unicode directly :- unicode = tis620 - 0xa0 + 0x0e00 Which is perfect and ease transition of code. We can modified our code just a little bit to make it work on both tis-620 and unicode (see libinthai, a Thai word-break library, as an example). However, there're still some problems which is beyond assignments of code points, that's char. properties. There're some mistakes in Unicode char. properties for Thai char. and you have to "code around" that.
Re: What is TISI character Code?
Sreedhar.M wrote: > I would lilke to make my application to Thai language compatible.In > that way I heard the term TISI character code.That's why I want to know > about the TISI character code.Please let me know if anybody have an idea > regarding this. TISI is the name of the standard organization in Thailand, Thai Industry Standard Institute. The character set name is tis-620. It's a 8-bit character set which is an extension to 7-bit ASCII for Thai characters. See :- http://www.nectec.or.th/it-standards/ -- Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: Fixed position combining classes
[EMAIL PROTECTED] wrote: > On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote: >>>My opinion is that they should have been simplified, but that setting the >>>bulk of them to 0 was a mistake and creates some significant problems >>>(which go a step beyond the questions you raise here). >>Can you elaborate on this? > Given the characters > : 0E35;THAI CHARACTER SARA II;Mn;0 > : 0E39;THAI CHARACTER SARA UU;Mn;103 > consider the sequences > < 0e35, 0e39 > vs. < 0e39, 0e35 > > I'm guessing your first reaction will be to say that these cannot co-occur. No, not at all :) I already learn from you to be more open-minded to this Unicode kind of things. > That is true for the Thai language, but may not be true for other languages > written with Thai script. I've read a book on the history of Thai characters and found that many vowels change position through history. So this issue is more understandable to me now. > Now, the problem with the sequences above is that they are visually > indistinct, meaning that they could not possibly be used by users for a > semantically-relevant distinction. From the user's perspective, they are > identical. Moreover, it would fit a user's expectations to have string > comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a > match if the data contains < 0e39, 0e35 >). They are both > canonically-ordered sequences, however, since U+0E35 has a combining class > of 0. The result is that string comparisons that rely on normalisation into > any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC) > will fail to consider these as equal. Let's talk about somethings that really happend in Thai. 1) 0E01;THAI CHARACTER KO KAI;Lo;0 0E38;THAI CHARACTER SARA U;Mn;103 0E4D;THAI CHARACTER NIKHAHIT;Mn;0 The sequences (which happend in Pali transcription) (a) KO KAI + SARA U + NIKHAHIT (b) KO KAI + NIKHAHIT + SARA U They're look the same but not equal because combining class of NIKHAHIT happend to be 0 so both are normalized. 2) 0E32;THAI CHARACTER SARA AA;Lo;0 0E48;THAI CHARACTER MAI EK;Mn;107 0E33;THAI CHARACTER SARA AM;Lo;0;L; "NIKHAHIT" "SARA AA" There're two ways to represent the word KO KAI + MAI EK + SARA AM (a) KO KAI + MAI EK + SARA AM (b) KO KAI + NIKHAHIT + MAI EK + SARA AA (b) must be in this sequence to get the intended look for the word (not that this is the valid sequence for Thai/WTT). That is the mai-ek is on top of the nikhahit. The problem is with the NFKD/NFKC of (a), which is (c) KO KAI + MAI EK + NIKHAIT + SARA AA Which will be rendered with nikhahit on top of mai-ek. Which is not the same as (a), and is not the intened look. So this means that the string change its shape after normalization. Is this a violation of any principle? The problem comes also from the fact that combining class of NIKHAHIT is 0 and that make reording of (c) impossible. -- Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Indic scripts, visual-order vs phonetic-order
Hello, I'm wondering about the practice of using visual-order vs phonetic-order in Indic writing on typewriter vs computer vs handwritten. Are they all the same? I also heard that there are two input-method styles for Indic, visual-order and phonetic-order. Is it true? And what is more popular? -- Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: Thai character names
[EMAIL PROTECTED] wrote: > Another interesting point is that two of these four letters are now > considered obsolete: kho khuat and kho khon. I have heard an explanation -- > but don't know if it is true -- that the King decided to deprecate them > when typewriters were being adapted for Thai because there were two too > many characters that could be fit onto the limitations of the imported > mechanisms. I heard that before, from a source that is related to Thai IT standardization. But I've just found recently that this may not be true. A book on the history of Thai characters says that 29-May-1942, the prime-minister Por. Pi-Boon-Song-Kalm removed 13 consonants and 5 vowels. After his government, people got back to the own system but the two characters kho khuat and kho khon never came back again. IMO, this is more likely what actually happended because the Thai typewriters have some keys available and they're assigned to other things such as the combination of a tone mark and a vowel. So I think that at the time the typewriters were being adapted to Thai, they may actually lost that two letters already. The current keyboard standard adds that two letters and more by removing keys that're considered redundant. -- Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: Fixed position combining classes (Was: Combining class for Thai characters)
Hi :) Thank you for the invaluable reply and sorry for my confusing English. I'll try to be as clear as possible in the future. I'm not good at English, especially at using the apropriate level (polite/aggressive) of language for particular meaning. I'm learning about Unicode and love it every much. The problem is that I only have experiences with processing Thai. So all of my comments are actually questions. Please add "correct me if I'm wrong" to all of them. I'll throw in related data from the Unicode website/book to make it clear for others in the discussions, which you can see in the Ccs. Please use Reply All so everyone will get it. [EMAIL PROTECTED] wrote: > On 05/21/2002 10:07:32 AM Samphan Raruenrom wrote: >>Why the above-attached vowel signs/marks all have combining class 0? > I'm not positive on the history, but here's my take: As you mention, there > is a sequencing constraint in WTT. In an earlier version of the Unicode > standard (prior to 2.1) all of the Thai characters of category Mn had > fixed-position classes. I'm guessing that that was influenced by a notion > of there needing to be a specific order, as in WTT. This is what I've guessed too. >>So (correct me if I'm wrong) the notion of invalid sequence in Unicode >>is script-specific. > Yes, but be careful of misinterpreting combining classes as saying > anything about what is or isn't a valid sequence -- they say > absolutely nothing in that regard. I see. I misunderstood that. > It didn't really accomplish anything to have all the different fixed > position classes, though. If anything, it created some complications, > which I won't elaborate on. Your answer leads me to the version 2.0.14 of UnicodeData, quoted. UnicodeData-2.0.14.txt : 0E31;THAI CHARACTER MAI HAN-AKAT;Mn;98 : 0E34;THAI CHARACTER SARA I;Mn;99 : 0E35;THAI CHARACTER SARA II;Mn;100 : 0E36;THAI CHARACTER SARA UE;Mn;101 : 0E37;THAI CHARACTER SARA UEE;Mn;102 : 0E38;THAI CHARACTER SARA U;Mn;103 : 0E39;THAI CHARACTER SARA UU;Mn;104 : 0E3A;THAI CHARACTER PHINTHU;Mn;105 : 0E47;THAI CHARACTER MAITAIKHU;Mn;106 : 0E48;THAI CHARACTER MAI EK;Mn;107 : 0E49;THAI CHARACTER MAI THO;Mn;108 : 0E4A;THAI CHARACTER MAI TRI;Mn;109 : 0E4B;THAI CHARACTER MAI CHATTAWA;Mn;110 : 0E4C;THAI CHARACTER THANTHAKHAT;Mn;111 : 0E4D;THAI CHARACTER NIKHAHIT;Mn;112 : 0E4E;THAI CHARACTER YAMAKKAN;Mn;128 I agree that they should be simplified. All of the Mn are simply assigned distinct increasing values (note that none is 0). > At any rate, between 2.0 and 3.0, a lot of fixed-position > classes, both for Thai and for other scripts, were simplified. In so > doing, many were set to 0. http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html#Modification History : Unicode 2.1.8 : Changes to combining class values. Most Indic fixed position class : non-spacing marks were changed to combining class 0. This fixes some : inconsistencies in how canonical reordering would apply to Indic : scripts, including Tibetan. Indic interacting top/bottom fixed position : classes were merged into single (non-zero) classes as part of this : change. Tibetan subjoined consonants are changed from combining class 6 : to combining class 0. Thai pinthu (U+0E3A) moved to combining class 9. : Moved two Devanagari stress marks into generic above and below combining : classes (U+0951, U+0952). Let's talk about the idea behind combining classes. From "The Unicode Standard 3.0" and information from you, it's my impression that : (1) The reason for having combining classes came from the different ways possible to encode the same character. The same character must always compare eqaul no matter how it is encoded, using precomposed characters or through composition. (2) The criteria for assigning combining classes is that the string before and after normalization must be rendered the same. The text that look the same must always compare equal, regardless of the order of (non-interacting) marks in the memory representation. For example, BASE + ABOVE_MARK + BELOW_MARK = BASE + BELOW_MARK + ABOVE_MARK At least for Indic (which includes Thai), the criteria before 2.1, seemed to ensure just (1), discarded entirely typographically interatacting marks. This could be accomplished w/o combining class at all, simply sort the marks using their code point will do. To ensure (2), interacting marks must be assigned the same (non-zero) combining class as said in the modification history (requoted). Note:Unlike other classses, the relation of different classes in fixed position classes is not clear. All I know it that class 10..199 are called fixed position classes. I can't find any detail on that. Do you have any? : Indic interacting top/bottom fixed position classes were merged into : single (*non_zero*) classes as par
Logical_Order_Exception actually means Phonetic_Order_Exception ?
8<->8 The definition of this newly introduced property in Unicode 3.2 :- http://www.unicode.org/unicode/reports/tr28/#database Logical_Order_Exception: There are a small number of characters (in the Thai and Lao scripts) that do not use logical order. These characters require special handling in most processing. The difinition of Logical Order :- The Unicode Standard 3.0 : Section 2.2 Unicode Design Principles Logical Order: For "ALL" scripts, Unicode text is stored in 'logical order' in the memory representation, roughly corresponing to the order in which text is typed in via the keyboard. ... For the most part, logical order corresponds to 'phonetic order'. The only current exceptions are the Thai and Lao scripts, which employ visual ordering; in these two scripts, users traditionnally type in visual order rather than phonetic order. 8<->8 ALL scripts in Unicode are stored in 'logical order'. For the most part, logical order corresponds to 'phonetic order'. The only exceptions are Thai and Lao. Do you think that Logical_Order_Exception should actually be called Phonetic_Order_Exception? 8<- References --->8 The followings are the only Logical_Order_Excention in Unicode 3.2 :- http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt 0E40..0E44; Logical_Order_Exception # Lo [5] THAI CHARACTER SARA E .. THAI CHARACTER SARA AI MAIMALAI 0EC0..0EC4; Logical_Order_Exception # Lo [5] LAO VOWEL SIGN E .. LAO VOWEL SIGN AI -- Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Combining class for Thai characters
it;;; 0E4E;THAI CHARACTER YAMAKKAN;Mn;0;NSM;N;THAI YAMAKKAN 0E4F;THAI CHARACTER FONGMAN;Po;0;L;N;THAI FONGMAN >88< Regards, Samphan Raruenrom Information Research and Development Division National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: Thai word list
Werner LEMBERG wrote: > I'm searching a large word list for Thai which is freely available, > i.e., either under a license similar to GPL (resp. compatible to the > GPL) or in the public domain. > Do you know whether such a file is available? This is the standard pubilc domain (3+ words) word list caled RIWord from NECTEC (www.nectec.or.th) http://www.links.nectec.or.th/itech/download.html -> ftp://www.links.nectec.or.th/pub/thaidb/riwords.txt.gz Note. You need to filter out word-with-hyphen and word-with-space. Can you tell me what do you want it for? Word-breaking? Spelling-check? I may be able to help you in these area. See http://developer.thai.net/libinthai/ - an open-source word-break library