Re: New version of TR29:

2002-08-16 Thread Samphan Raruenrom

Mark Davis wrote:
> There is a new version of Unicode Technical Report #29: Text Boundaries on
> <http://www.unicode.org/reports/tr29/>, covering grapheme-cluster, word and
> sentence boundaries. There are significant modifications to this version;
> for a summary, see <http://www.unicode.org/reports/tr29/#Modifications>.
> This is a draft version, not a final version. There are a number of open
> issues remaining. Feedback is welcome
> Feedback that is received before the UTC meeting (starting August 20) can be
> made available for the discussion of TR29 at that meeting.

FYI:
There're an open issue regarding grapheme-cluster boundaries in Thai.

* SARA AM as an Other_Grapheme_Extend?

Whether "0E33;THAI CHARACTER SARA AM" should be a GraphemeExtend 
character or not?

By Unicode definition, SARA AM is an Lo, not a combining
character. But many Thai applications (MS Office/ Windows/ 
OpenOffice.org) treats SARA AM like a combining character (unlike SARA
AA), i.e. cursor always jump over it. Whether this is right or not is
controversial but the fact is that Windows users are used to it.

My personal question is that, if it is favorable for Thai to treat
SARA AM as part of the previous grapheme cluster, is it possible for
UTC to consider adding SARA AM as an Other_Grapheme_Extend?


---
I also notice that Grapheme_Link is removed from the grapheme-cluster
definition. This is appropriate for Thai because PHINTHU should not
cause two grapheme clusters to be linked together.

-- 
Feel free to disclose the contents of this message.

Regards,
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: logical order (and input method)

2002-07-27 Thread Samphan Raruenrom

Kenneth Whistler wrote:
> The "Indic model" is largely based on an abstraction of 
> the phonology of the language the script is used to write
> The "Thai model" is a typewriter-derived variant of the
> Indic model that rules out reordrant or surroundrant characters,
> because of the limitations of typewriter technology

Just a curiosity, I'm a Thai and used to the Thai model so I'm
wondering how other brahmi-derived scripts are
1) typed on typewriter
2) typed on computer keyboard
3) hand-written

That is, are they all using the same (logical) order?

-- 
Feel free to disclose the contents of this message.

Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





logical order (and Thai)

2002-07-27 Thread Samphan Raruenrom

Kenneth Whistler wrote:
> Ummm. Logical order, visual order, aural order, phonemic order,
> linear order... We are in danger of losing track of the ground we
> stand on.

Totally agree.

> Logical order versus visual order, in the Unicode Standard,
> refers to the relationship between backing store order and
> display order. The main issue is for bidirectional text.

Fortunately, this is put clear enough in the Unicode book,

> There is a separate issue which has to do with alternative
> models of Brahmi-derived scripts.
> The "Indic model" ...
> The "Thai model" ...
> Note, however, that *both* of these models inherently imply non-linear
> mappings at some level. In the Indic model, the mapping from
> phonology to backing store is straightforward, but the mapping
> from backing store to display (i.e., the "rendering") will
> have local direction reversals and/or 1-2 character-to-glyph
> mappings, in the case of reordrant or surroundrant vowels.
> The Thai model displaces the mapping complexity to the
> mapping from phonology to backing store, while simplifying the
> rendering.

But this is not. It would be easier to avoid these confusions if
the above description about "non-linear mapping" of Brahmi-derived
scripts was written clearly in Chapter 2 of the book in the
section about logical order.

> Given this picture, it should now be easier to see why Thai
> rendering is easier than Devanagari, but Thai sorting
> (which runs afoul of the mismatch between phonology and
> backing store order) in more problematical. It is simply
> a tradeoff of which level of processing gets the complexity.

Does this mean that there's nothing illogical or less-prefered
with the Thai model?
If so, please also consider the following question (a little bit
rephrased)


 Original Message 
Subject: Logical_Order_Exception actually means Phonetic_Order_Exception ?
Date: Sat, 01 Jun 2002 12:00:09 +0700
From: Samphan Raruenrom <[EMAIL PROTECTED]>
Organization: NECTEC
To: Unicode Public List <[EMAIL PROTECTED]>
CC: Thai IT Standards Newsgroup 
<[EMAIL PROTECTED]>,   Virach Sornlertlamvanich 
<[EMAIL PROTECTED]>,   Trin Tansetthi <[EMAIL PROTECTED]>, 
Suwit Srivilairith <[EMAIL PROTECTED]>


It's said (below) that ALL scripts in Unicode are stored in 'logical
order'.  And for the most part, logical order corresponds to 'phonetic
order'. And the only exceptions are Thai and Lao.
Do you think that Logical_Order_Exception should actually be called
Phonetic_Order_Exception?


8<- References --->8
The definition of this newly introduced property in Unicode 3.2 :-
 http://www.unicode.org/unicode/reports/tr28/#database

Logical_Order_Exception:
There are a small number of characters (in the Thai and Lao scripts)
that do not use logical order. These characters require special 
handling in most processing.

The difinition of Logical Order :-
 The Unicode Standard 3.0 : Section 2.2 Unicode Design Principles

Logical Order:
For "ALL" scripts, Unicode text is stored in 'logical order' in the
memory representation, roughly corresponing to the order in which
text is typed in via the keyboard.
...
For the most part, logical order corresponds to 'phonetic order'.
The only current exceptions are the Thai and Lao scripts, which
employ visual ordering; in these two scripts, users traditionnally
type in visual order rather than phonetic order.

The followings are the only Logical_Order_Excention in Unicode 3.2 :-
http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt

0E40..0E44; Logical_Order_Exception # Lo   [5] THAI CHARACTER
    SARAE .. THAI CHARACTER SARA AI MAIMALAI
0EC0..0EC4; Logical_Order_Exception # Lo   [5] LAO VOWEL
SIGN E .. LAO VOWEL SIGN AI

-- 
Feel free to forward or quote to any individual or public.

Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html






Re: Is UniCode's Thai character representation is acceptable by TISI or not?

2002-07-16 Thread Samphan Raruenrom

Dear Mark,

Thanks for informative reply. :)

Mark Davis wrote:
 > Some comments below.
 > - Original Message -
 > From: "Samphan Raruenrom" <[EMAIL PROTECTED]>
 > To: "Asmus Freytag" <[EMAIL PROTECTED]>
 > Cc: "Sreedhar M" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; "Rick McGowan" 
 ><[EMAIL PROTECTED]>
 > Sent: Tuesday, July 16, 2002 07:22
 > Subject: Re: Is UniCode's Thai character representation is acceptable by TISI or not?
 >>Asmus Freytag wrote:
 >>>At 12:06 PM 7/16/02 +0700, Samphan Raruenrom wrote:
 >>Problems from Unicode properties
 >>- error in combining class of vowel signs make normalization worthless
 >>   in some cases. This is important if you want to compare strings.
 > Meaning: the normalized forms of two strings are not equal in cases
 > where Thais would consider them equal, right?

Definitely.

 >>- decomposition of SARA AM add more problem to normalization
 > I don't recall seeing that note; I'll look forward to your report.

Please see my discussion with khun Peter Constable quoted below.

 >>- some properties make grapheme cluster for Thai
 >>   imcompatible with the way Thai expect, e.g PINTHU as
 >>   virama, SARA AM not a combining character
 > In the last UTC, action was taken that is not yet in the draft TR on
 > boundaries. In particular, this affects Thai.

Glad to hear that :)

 >>Inaccuracy in the Unicode book
 >>- backspace 'always' use the same (grapheme cluster) character boundary
 >>   as Del and left/right arrow. Actually Thai use backspace to delete single
 >>   character not the whole cluster. So character boundary for backspace should
  >>   be locale specific.
 > This text will be overriden by the TR.

Great!

 >>- in Thai, zero width space is said to be able to expand in full-justified
 >>   paragraph. Actually it is always zero width.
 > There may be some misunderstanding here. What is meant is: if you had
 > the sequence ABCD, and between the B and the C was a zero-width space,
 > AND you were inter-character spacing for justification, you would not
 > expect to see:
 > A  BC  D
 > Instead, you would expect to see
 > ABCD
  > That is, the zero-width space does not prevent the characters from
  > using inter-character spacing.

Sorry for misunderstanding that. A short explanation/example like this in
the book (chapter 9), will help a lot.

 >>These are things you have to khow after learning the Unicode standard
 >>if you plan to work with Thai language, to 'code around' the problem
 >>to make it acceptable for Thai people.
 >>I plan to write a formal report on the issue, not to change the standard,
 >>but to note what is wrong and what have to be code around. So people
 >>who like to work with Thai language (like you) will know the right thing
 >>to do and not repeat the same mistake as in some softwares.



 Original Message 
Subject: Re: Fixed position combining classes
Date: Thu, 06 Jun 2002 21:53:35 +0700
From: Samphan Raruenrom <[EMAIL PROTECTED]>
Organization: NECTEC
To: [EMAIL PROTECTED]
CC: Arthit Suriyawongkul <[EMAIL PROTECTED]>,Suwit Srivilairith 
<[EMAIL PROTECTED]>, Thai IT Standards Newsgroup 
<[EMAIL PROTECTED]>,   Trin Tansetthi <[EMAIL PROTECTED]>,   
 Unicode Public List <[EMAIL PROTECTED]>,  Virach Sornlertlamvanich 
<[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]>

[EMAIL PROTECTED] wrote:
 > Now, the problem with the sequences above is that they are visually
 > indistinct, meaning that they could not possibly be used by users for a
 > semantically-relevant distinction. From the user's perspective, they are
 > identical. Moreover, it would fit a user's expectations to have string
 > comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a
 > match if the data contains < 0e39, 0e35 >). They are both
 > canonically-ordered sequences, however, since U+0E35 has a combining class
 > of 0. The result is that string comparisons that rely on normalisation into
 > any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC)
 > will fail to consider these as equal.

Let's talk about somethings that really happend in Thai.

1)

0E01;THAI CHARACTER KO KAI;Lo;0
0E38;THAI CHARACTER SARA U;Mn;103
0E4D;THAI CHARACTER NIKHAHIT;Mn;0

The sequences (which happend in Pali transcription)

(a) KO KAI + SARA U + NIKHAHIT
(b) KO KAI + NIKHAHIT + SARA U

They're look the same but not equal because combining class
of NIKHAHIT happend to be 0 so both are normalized.

2)

0E32;THAI CHARACTER SARA AA;Lo;0
0E48;THAI CHARACTER MAI EK;Mn;107
0E33;THAI CHARACTER

Re: Is UniCode's Thai character representation is acceptable by TISI or not?

2002-07-16 Thread Samphan Raruenrom

Asmus Freytag wrote:
> At 12:06 PM 7/16/02 +0700, Samphan Raruenrom wrote:
>> There're some mistakes in Unicode char.
>> properties for Thai char. and you have to "code around" that.
> And the mistakes are?

I've discussed a few of them here in this list. I'll write
a more formal report on the issue later. Here're some titles

Problems from Unicode properties
- error in combining class of vowel signs make normalization worthless
   in some cases. This is important if you want to compare strings.
- decomposition of SARA AM add more problem to normalization
- some properties make grapheme cluster for Thai
   imcompatible with the way Thai expect, e.g PINTHU as
   virama, SARA AM not a combining character

Inaccuracy in the Unicode book
- backspace 'always' use the same (grapheme cluster) character boundary
   as Del and left/right arrow. Actually Thai use backspace to delete single
   character not the whole cluster. So character boundary for backspace
   should be locale specific.
- in Thai, zero width space is said to be able to expand in full-justified
   paragraph. Actually it is always zero width.

These are things you have to khow after learning the Unicode standard
if you plan to work with Thai language, to 'code around' the problem
to make it acceptable for Thai people.
I plan to write a formal report on the issue, not to change the standard,
but to note what is wrong and what have to be code around. So people
who like to work with Thai language (like you) will know the right thing
to do and not repeat the same mistake as in some softwares.

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: Is UniCode's Thai character representation is acceptable by TISI or not?

2002-07-15 Thread Samphan Raruenrom

Sreedhar M wrote:
> Thank U for Your kind response.Please let me know whether
> Unicode's Thai character represation is acceptable by TISI or not? It is
> very essential to our project.

Yes. TISI had taken part in the representation of Thai char. in ISO 10646
(and Unicode indirectly). Unicode has backward-compatibility goal so
it takes the whole Thai block in TIS-620 to Unicode directly :-
unicode = tis620 - 0xa0 + 0x0e00
Which is perfect and ease transition of code. We can modified our code
just a little bit to make it work on both tis-620 and unicode (see
libinthai, a Thai word-break library, as an example).

However, there're still some problems which is beyond assignments of code
points, that's char. properties. There're some mistakes in Unicode char.
properties for Thai char. and you have to "code around" that.





Re: What is TISI character Code?

2002-07-12 Thread Samphan Raruenrom

Sreedhar.M wrote:
> I would lilke to make my application to Thai language compatible.In 
> that way I heard the term TISI character code.That's why I want to know 
> about the TISI character code.Please let me know if anybody have an idea 
> regarding this.

TISI is the name of the standard organization in Thailand, Thai Industry
Standard Institute. The character set name is tis-620. It's a 8-bit character
set which is an extension to 7-bit ASCII for Thai characters. See :-

http://www.nectec.or.th/it-standards/

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: Fixed position combining classes

2002-06-06 Thread Samphan Raruenrom

[EMAIL PROTECTED] wrote:
> On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote:
>>>My opinion is that they should have been simplified, but that setting the
>>>bulk of them to 0 was a mistake and creates some significant problems
>>>(which go a step beyond the questions you raise here).
>>Can you elaborate on this?
> Given the characters
> : 0E35;THAI CHARACTER SARA II;Mn;0
> : 0E39;THAI CHARACTER SARA UU;Mn;103
> consider the sequences
> < 0e35, 0e39 > vs. < 0e39, 0e35 >
> I'm guessing your first reaction will be to say that these cannot co-occur.

No, not at all :) I already learn from you to be more open-minded to
this Unicode kind of things.

> That is true for the Thai language, but may not be true for other languages
> written with Thai script.

I've read a book on the history of Thai characters and found that many
vowels change position through history. So this issue is more
understandable to me now.

> Now, the problem with the sequences above is that they are visually
> indistinct, meaning that they could not possibly be used by users for a
> semantically-relevant distinction. From the user's perspective, they are
> identical. Moreover, it would fit a user's expectations to have string
> comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a
> match if the data contains < 0e39, 0e35 >). They are both
> canonically-ordered sequences, however, since U+0E35 has a combining class
> of 0. The result is that string comparisons that rely on normalisation into
> any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC)
> will fail to consider these as equal.

Let's talk about somethings that really happend in Thai.

1)

0E01;THAI CHARACTER KO KAI;Lo;0
0E38;THAI CHARACTER SARA U;Mn;103
0E4D;THAI CHARACTER NIKHAHIT;Mn;0

The sequences (which happend in Pali transcription)

(a) KO KAI + SARA U + NIKHAHIT
(b) KO KAI + NIKHAHIT + SARA U

They're look the same but not equal because combining class
of NIKHAHIT happend to be 0 so both are normalized.

2)

0E32;THAI CHARACTER SARA AA;Lo;0
0E48;THAI CHARACTER MAI EK;Mn;107
0E33;THAI CHARACTER SARA AM;Lo;0;L; "NIKHAHIT" "SARA AA"

There're two ways to represent the word KO KAI + MAI EK + SARA AM

(a) KO KAI + MAI EK + SARA AM
(b) KO KAI + NIKHAHIT + MAI EK + SARA AA

(b) must be in this sequence to get the intended look for
the word (not that this is the valid sequence for Thai/WTT).
That is the mai-ek is on top of the nikhahit.

The problem is with the NFKD/NFKC of (a), which is

(c) KO KAI + MAI EK + NIKHAIT + SARA AA

Which will be rendered with nikhahit on top of mai-ek.
Which is not the same as (a), and is not the intened look.
So this means that the string change its shape after
normalization. Is this a violation of any principle?

The problem comes also from the fact that combining class of
NIKHAHIT is 0 and that make reording of (c) impossible.

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Indic scripts, visual-order vs phonetic-order

2002-06-05 Thread Samphan Raruenrom

Hello,

I'm wondering about the practice of using visual-order vs phonetic-order
in Indic writing on typewriter vs computer vs handwritten. Are they
all the same?

I also heard that there are two input-method styles for Indic,
visual-order and phonetic-order. Is it true? And what is more popular?

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: Thai character names

2002-06-05 Thread Samphan Raruenrom

[EMAIL PROTECTED] wrote:
 > Another interesting point is that two of these four letters are now
 > considered obsolete: kho khuat and kho khon. I have heard an 
explanation --
 > but don't know if it is true -- that the King decided to deprecate them
 > when typewriters were being adapted for Thai because there were two too
 > many characters that could be fit onto the limitations of the imported
 > mechanisms.

I heard that before, from a source that is related to Thai IT 
standardization. But I've just found recently that this may
not be true. A book on the history of Thai characters says that
29-May-1942, the prime-minister Por. Pi-Boon-Song-Kalm removed
13 consonants and 5 vowels. After his government, people got
back to the own system but the two characters kho khuat and
kho khon never came back again.

IMO, this is more likely what actually happended because the Thai
typewriters have some keys available and they're assigned to other
things such as the combination of a tone mark and a vowel.
So I think that at the time the typewriters were being adapted to
Thai, they may actually lost that two letters already. The current
keyboard standard adds that two letters and more by removing keys
that're considered redundant.


-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: Fixed position combining classes (Was: Combining class for Thai characters)

2002-06-02 Thread Samphan Raruenrom

Hi :)

Thank you for the invaluable reply and sorry for my confusing English.
I'll try to be as clear as possible in the future. I'm not good at
English, especially at using the apropriate level (polite/aggressive)
of language for particular meaning.
I'm learning about Unicode and love it every much. The problem is that
I only have experiences with processing Thai. So all of my comments
are actually questions. Please add "correct me if I'm wrong" to all of
them.
I'll throw in related data from the Unicode website/book to make it
clear for others in the discussions, which you can see in the Ccs.
Please use Reply All so everyone will get it.


[EMAIL PROTECTED] wrote:
> On 05/21/2002 10:07:32 AM Samphan Raruenrom wrote:
>>Why the above-attached vowel signs/marks all have combining class 0?
> I'm not positive on the history, but here's my take: As you mention, there 
> is a sequencing constraint in WTT. In an earlier version of the Unicode 
> standard (prior to 2.1) all of the Thai characters of category Mn had 
> fixed-position classes. I'm guessing that that was influenced by a notion 
> of there needing to be a specific order, as in WTT. 

This is what I've guessed too.

 >>So (correct me if I'm wrong) the notion of invalid sequence in Unicode
 >>is script-specific.
 > Yes, but be careful of misinterpreting combining classes as saying
 > anything about what is or isn't a valid sequence -- they say
 > absolutely nothing in that regard.

I see. I misunderstood that.

 > It didn't really accomplish anything to have all the different fixed
 > position classes, though. If anything, it created some complications,
 > which I won't elaborate on.

Your answer leads me to the version 2.0.14 of UnicodeData, quoted.

UnicodeData-2.0.14.txt
: 0E31;THAI CHARACTER MAI HAN-AKAT;Mn;98
: 0E34;THAI CHARACTER SARA I;Mn;99
: 0E35;THAI CHARACTER SARA II;Mn;100
: 0E36;THAI CHARACTER SARA UE;Mn;101
: 0E37;THAI CHARACTER SARA UEE;Mn;102
: 0E38;THAI CHARACTER SARA U;Mn;103
: 0E39;THAI CHARACTER SARA UU;Mn;104
: 0E3A;THAI CHARACTER PHINTHU;Mn;105
: 0E47;THAI CHARACTER MAITAIKHU;Mn;106
: 0E48;THAI CHARACTER MAI EK;Mn;107
: 0E49;THAI CHARACTER MAI THO;Mn;108
: 0E4A;THAI CHARACTER MAI TRI;Mn;109
: 0E4B;THAI CHARACTER MAI CHATTAWA;Mn;110
: 0E4C;THAI CHARACTER THANTHAKHAT;Mn;111
: 0E4D;THAI CHARACTER NIKHAHIT;Mn;112
: 0E4E;THAI CHARACTER YAMAKKAN;Mn;128

I agree that they should be simplified. All of the Mn are simply
assigned distinct increasing values (note that none is 0).

> At any rate, between 2.0 and 3.0, a lot of fixed-position 
> classes, both for Thai and for other scripts, were simplified. In so 
> doing, many were set to 0.

http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html#Modification 
History
: Unicode 2.1.8
: Changes to combining class values. Most Indic fixed position class
: non-spacing marks were changed to combining class 0. This fixes some
: inconsistencies in how canonical reordering would apply to Indic
: scripts, including Tibetan. Indic interacting top/bottom fixed position
: classes were merged into single (non-zero) classes as part of this
: change. Tibetan subjoined consonants are changed from combining class 6
: to combining class 0. Thai pinthu (U+0E3A) moved to combining class 9.
: Moved two Devanagari stress marks into generic above and below combining
: classes (U+0951, U+0952).

Let's talk about the idea behind combining classes. From "The Unicode
Standard 3.0" and information from you, it's my impression that :
(1) The reason for having combining classes came from the different ways
possible to encode the same character. The same character must always
compare eqaul no matter how it is encoded, using precomposed characters 
or through composition.
(2) The criteria for assigning combining classes is that the string
before and after normalization must be rendered the same. The text that
look the same must always compare equal, regardless of the order of
(non-interacting) marks in the memory representation. For example,
BASE + ABOVE_MARK + BELOW_MARK = BASE + BELOW_MARK + ABOVE_MARK

At least for Indic (which includes Thai), the criteria before 2.1,
seemed to ensure just (1), discarded entirely typographically
interatacting marks. This could be accomplished w/o combining class at
all, simply sort the marks using their code point will do.
To ensure (2), interacting marks must be assigned the same (non-zero)
combining class as said in the modification history (requoted).

   Note:Unlike other classses, the relation of different classes
in fixed position classes is not clear. All I know it that
class 10..199 are called fixed position classes.
I can't find any detail on that. Do you have any?

: Indic interacting top/bottom fixed position classes were merged into
: single (*non_zero*) classes as par

Logical_Order_Exception actually means Phonetic_Order_Exception ?

2002-05-31 Thread Samphan Raruenrom

8<->8
The definition of this newly introduced property in Unicode 3.2 :-
 http://www.unicode.org/unicode/reports/tr28/#database

Logical_Order_Exception:
There are a small number of characters (in the Thai and Lao scripts)
that do not use logical order. These characters require special handling
in most processing.

The difinition of Logical Order :-
 The Unicode Standard 3.0 : Section 2.2 Unicode Design Principles

Logical Order:
For "ALL" scripts, Unicode text is stored in 'logical order' in the
memory representation, roughly corresponing to the order in which
text is typed in via the keyboard.
...
For the most part, logical order corresponds to 'phonetic order'.
The only current exceptions are the Thai and Lao scripts, which
employ visual ordering; in these two scripts, users traditionnally
type in visual order rather than phonetic order.

8<->8


ALL scripts in Unicode are stored in 'logical order'.  For the most
part, logical order corresponds to 'phonetic order'. The only
exceptions are Thai and Lao.
Do you think that Logical_Order_Exception should actually be called
Phonetic_Order_Exception?

8<- References --->8

The followings are the only Logical_Order_Excention in Unicode 3.2 :-
http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt

0E40..0E44; Logical_Order_Exception # Lo   [5] THAI CHARACTER SARA E
.. THAI CHARACTER SARA AI MAIMALAI
0EC0..0EC4; Logical_Order_Exception # Lo   [5] LAO VOWEL SIGN E
.. LAO VOWEL SIGN AI
-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html






Combining class for Thai characters

2002-05-21 Thread Samphan Raruenrom
it;;;
0E4E;THAI CHARACTER YAMAKKAN;Mn;0;NSM;N;THAI YAMAKKAN
0E4F;THAI CHARACTER FONGMAN;Po;0;L;N;THAI FONGMAN
  >88<

Regards,
Samphan Raruenrom
Information Research and Development Division
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: Thai word list

2002-04-17 Thread Samphan Raruenrom

Werner LEMBERG wrote:
> I'm searching a large word list for Thai which is freely available,
> i.e., either under a license similar to GPL (resp. compatible to the
> GPL) or in the public domain.
> Do you know whether such a file is available?

This is the standard pubilc domain (3+ words) word list
caled RIWord from NECTEC (www.nectec.or.th)

http://www.links.nectec.or.th/itech/download.html
-> ftp://www.links.nectec.or.th/pub/thaidb/riwords.txt.gz

Note.
You need to filter out word-with-hyphen and word-with-space.

Can you tell me what do you want it for? Word-breaking? Spelling-check?
I may be able to help you in these area. See
http://developer.thai.net/libinthai/ - an open-source word-break library