Re: announcing thaiword.el?

Miles Bader Tue, 29 Mar 2005 00:36:14 -0800

On Mon, 28 Mar 2005 09:47:09 +0900 (JST), Kenichi Handa <[EMAIL PROTECTED]> 
wrote:
> To handle the regular expression "\\b" and "\\B" correctly
> for Thai, we need a bigger change in regex.c.  For the
> moment, I have no idea how to do that.


Current extensions to "word syntax", using `word-separating-categories'
etc., seem to do the correct thing with regexps.[*]  Perhaps some
extension to that mechanism would work.

For instance, what if entries in `word-separating-categories' could have an
optional predicate function -- in addition to the current (CAT1 . CAT2)
format, allow (CAT1 CAT2 PREDICATE-FUN), and only consider the entry to
match if PREDICATE-FUN fun (with some apropriate args) also returns true?

Then for a case like Thai, where you want to do more complicated tests
to establish word-boundaries inside sequences of non-delimited text,
could use a "degenerate" entry in `word-separating-categories' with both
CAT1 and CAT2 the same, but also with a predicate attached to do the
more complicated test.  I suppose that would slow down word matching
when the predicate is called, but it would only happen for text where
that is appropriate.

-Miles

[*] I was surprised that this is true, and I don't understand why from
    my quick look at regex.c :-/ ... But my simple tests seem to show
    that it does really work.  E.g., I can add '(?C . ?C) to
    `word-separating-categories', and then a regexp search will suddenly
    start considering every single kanji character as a standalone word.
-- 
Do not taunt Happy Fun Ball.


_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

Re: announcing thaiword.el?

Reply via email to