RE: Unicode Tokenizer problem with Registered Trademark Search

Steven A Rowe Wed, 02 Apr 2008 14:51:02 -0700

Hi Bruce,

On 04/02/2008 at 4:58 PM, [EMAIL PROTECTED] wrote:
> I am having a problem when searching for certain Unicode
> characters, such as the Registered Trademark. That's the
> Unicode character 00AE. It's also a problem searching for a
> Japanese Yen symbol (Unicode character 00A5).
> 
> I'm using the Lucene 2.0.0 jar file, and we used to use
> Lucene 1.4.2 jar file, where this used to work OK. But Lucene
> 2.0.0 doesn't work the same way.


I don't see anything that would have caused such a change - below is a colored 
side-by-side diff of StandardTokenizer.jj at revisions 150560 and 409716, 
corresponding to the lucene_1_4_2 and lucene_2_0_0 tags, respectively:

<http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_0_0/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=150560&r2=409716&diff_format=h>

(Note that the JavaCC-targetted StandardAnalyzer.jj was replaced at release 
2.3.0 by JFlex-targetted StandardTokenizerImpl.jflex for performance reasons - 
see <http://issues.apache.org/jira/browse/LUCENE-966>.)

> I see that the registered trademark is in the Lucene index
> file, so that's good. The problem comes when I try to search
> for these characters.
>
> I see that my query starts off OK, as this:
> 
> ( (Locale:en) AND ( productName:(Digital¥^95) ) )    (if you
> cannot see the Japanese Yen symbol, it comes directly after "Digital")
> 
> Note: the "^95" is just a boost factor, and is OK.
> 
> I'm using StandardAnalyzer and StandardTokenizer to create a
> new QueryParser , and after I call the "parse" method of the
> QueryParser, my query becomes this:
> 
>  +Locale:en +productName:digital^95.0
> 
> Notice that the Japanese Yen symbol is gone! I think it's
> because the StandardTokenizer.jj file doesn't handle this
> character, and so it throws it away.
> 
> Is there any way to use a different Analyzer and/or
> Tokenizer, rather than building my own?
> 
> And if I had created my Lucene indexes with the
> StandardAnalyzer, must I use the StandardAnalyzer and
> StandardTokenizer to search the index?

In order for the Yen and Registered Trademark symbols to appear in the index, 
you must have used a different analyzer for indexing than the one you're using 
for querying.  This can lead to problems, as you have discovered.

The short answer is: you should use the same analyzer.

The longer answer is that you should use "compatible" analyzers.  
"Compatibility" means that the terms produced by the query-time analyzer have 
corresponding index terms.  Of course, this condition is satisfied by using the 
same analyzer at both index- and query-time.  An example of compatibile, but 
different, analyzers is index- or query-time synonym injection.

I don't know why you weren't seeing this problem with Lucene 1.4.2, but is it 
possible that the 1.4.2-created index did *not* have these two symbols?  If 
that were true, then you would get the hits you're looking for, though you 
might get some others that you don't want.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Unicode Tokenizer problem with Registered Trademark Search

Reply via email to