[jira] Updated: (LUCENE-973) Token of "" returns in CJK

Steven Rowe (JIRA) Wed, 30 Jul 2008 11:40:22 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steven Rowe updated LUCENE-973:
-------------------------------


Hi Koji,

The test class in your patch is a nice addition.

bq. There is no problem in CJKAnalyzer. 

The only reason that CJKAnalyzer doesn't have this problem is that the empty 
string is one of the stopwords it filters out from CJKTokenizer's output!

The following part of your patch appears to address a problem that you haven't 
covered in your comments - is this so?  If it is a problem separate from the 
empty-string issue, can you describe the effects of this change?:

{code:java}
@@ -175,8 +175,9 @@
                             length = 0;
                             preIsTokened = false;
 
-                            break;
+                            continue;
                         } else {
+                            tokenType = "double";
                             break;
                         }
                     }
{code}


The other part of your patch reads:

{code:java}
@@ -236,8 +237,13 @@
             }
         }
 
-        return new Token(new String(buffer, 0, length), start, start + length,
+        String tokenString = new String(buffer, 0, length) ;
+        if( dataLen == -1 && "".equals(tokenString)) {
+          return null ;
+        } else {
+          return new Token(tokenString, start, start + length,
                          tokenType
                         );
+        }
{code}

Wouldn't it be simpler/clearer to test {{length}} for zero instead of 
constructing a String and testing it for equality with the empty string?:

{code:java}
if (length > 0) {
  String tokenString = new String(buffer, 0, length);
  return new Token(tokenString, start, start + length, tokenType);
} else {
  return null;
}
{code}


> Token of  "" returns in CJK
> ---------------------------
>
>                 Key: LUCENE-973
>                 URL: https://issues.apache.org/jira/browse/LUCENE-973
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Toru Matsuzawa
>         Attachments: CJKTokenizer20070807.patch, with-patch.jpg, 
> without-patch.jpg
>
>
> The "" string returns as Token in the boundary of two byte character and one 
> byte character. 
> There is no problem in CJKAnalyzer. 
> When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
> Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-973) Token of "" returns in CJK

Reply via email to