On Mon, 28 Mar 2005 09:47:09 +0900 (JST), Kenichi Handa <[EMAIL PROTECTED]> wrote: > To handle the regular expression "\\b" and "\\B" correctly > for Thai, we need a bigger change in regex.c. For the > moment, I have no idea how to do that.
Current extensions to "word syntax", using `word-separating-categories' etc., seem to do the correct thing with regexps.[*] Perhaps some extension to that mechanism would work. For instance, what if entries in `word-separating-categories' could have an optional predicate function -- in addition to the current (CAT1 . CAT2) format, allow (CAT1 CAT2 PREDICATE-FUN), and only consider the entry to match if PREDICATE-FUN fun (with some apropriate args) also returns true? Then for a case like Thai, where you want to do more complicated tests to establish word-boundaries inside sequences of non-delimited text, could use a "degenerate" entry in `word-separating-categories' with both CAT1 and CAT2 the same, but also with a predicate attached to do the more complicated test. I suppose that would slow down word matching when the predicate is called, but it would only happen for text where that is appropriate. -Miles [*] I was surprised that this is true, and I don't understand why from my quick look at regex.c :-/ ... But my simple tests seem to show that it does really work. E.g., I can add '(?C . ?C) to `word-separating-categories', and then a regexp search will suddenly start considering every single kanji character as a standalone word. -- Do not taunt Happy Fun Ball. _______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel