I'm willing to include this patch in 1.3 final. Are there any who see problems with it or otherwise oppose it?
Doug
John McNally wrote:
I'd certainly like to see a resolution to my sigram/cjk question/proposal a few days ago. It might not be a high priority issue, but I think if there is agreement it is a simple fix.
I'm sure I'm discussing stuff that is well known in this community, but will give some background just in case. There are three main ways to create tokens out of text. Character, multi-character (n-gram), and word. Words are generally considered the best; though for CJK languages using words means using a dictionary, since delimiters such as whitespace are not usually used, which increases complexity quite a bit.
An n-gram index usually has better precision than a character based index but a much larger index size. There is a bigram analyzer posted as an enhancement in bugzilla.
A character based index lead to long lists for each key, but given that inefficiency, they are easy to implement and have shown to be useful for CJK, one can even use phrase matching to get word matches. There was a patch made which uses the term sigram which I interpret to mean character based indexing. It, however, appears flawed. It is treating all consecutive CJK characters as a token; which in the case where there is no non-CJK characters in the text is the same as whole document matching. As this is almost the same behavior that was available prior to the patch, I think I am right in thinking there is a bug.
The patch could be small:
--- StandardTokenizer.jj-orig 2003-12-19 16:56:31.000000000 -0800
+++ StandardTokenizer.jj 2003-12-19 16:54:43.000000000 -0800
@@ -125,7 +125,7 @@
(<LETTER>|<DIGIT>)*
>
-| < SIGRAM: (<CJK>)+ >
+| < SIGRAM: (<CJK>) >
| < #ALPHA: (<LETTER>)+>
| < #LETTER: // unicode letters
[
I would think that removing SIGRAM and only using CJK as the token would be better, but I don't have a setup to test these changes.
Any chance this can be addressed?
john mcnally
On Fri, 2003-12-19 at 13:31, Doug Cutting wrote:
I'm thinking of making a 1.3 final release in the next few days.
Any objections?
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]