Hi Bruce, On 04/02/2008 at 4:58 PM, [EMAIL PROTECTED] wrote: > I am having a problem when searching for certain Unicode > characters, such as the Registered Trademark. That's the > Unicode character 00AE. It's also a problem searching for a > Japanese Yen symbol (Unicode character 00A5). > > I'm using the Lucene 2.0.0 jar file, and we used to use > Lucene 1.4.2 jar file, where this used to work OK. But Lucene > 2.0.0 doesn't work the same way.
I don't see anything that would have caused such a change - below is a colored side-by-side diff of StandardTokenizer.jj at revisions 150560 and 409716, corresponding to the lucene_1_4_2 and lucene_2_0_0 tags, respectively: <http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_0_0/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=150560&r2=409716&diff_format=h> (Note that the JavaCC-targetted StandardAnalyzer.jj was replaced at release 2.3.0 by JFlex-targetted StandardTokenizerImpl.jflex for performance reasons - see <http://issues.apache.org/jira/browse/LUCENE-966>.) > I see that the registered trademark is in the Lucene index > file, so that's good. The problem comes when I try to search > for these characters. > > I see that my query starts off OK, as this: > > ( (Locale:en) AND ( productName:(DigitalĀ„^95) ) ) (if you > cannot see the Japanese Yen symbol, it comes directly after "Digital") > > Note: the "^95" is just a boost factor, and is OK. > > I'm using StandardAnalyzer and StandardTokenizer to create a > new QueryParser , and after I call the "parse" method of the > QueryParser, my query becomes this: > > +Locale:en +productName:digital^95.0 > > Notice that the Japanese Yen symbol is gone! I think it's > because the StandardTokenizer.jj file doesn't handle this > character, and so it throws it away. > > Is there any way to use a different Analyzer and/or > Tokenizer, rather than building my own? > > And if I had created my Lucene indexes with the > StandardAnalyzer, must I use the StandardAnalyzer and > StandardTokenizer to search the index? In order for the Yen and Registered Trademark symbols to appear in the index, you must have used a different analyzer for indexing than the one you're using for querying. This can lead to problems, as you have discovered. The short answer is: you should use the same analyzer. The longer answer is that you should use "compatible" analyzers. "Compatibility" means that the terms produced by the query-time analyzer have corresponding index terms. Of course, this condition is satisfied by using the same analyzer at both index- and query-time. An example of compatibile, but different, analyzers is index- or query-time synonym injection. I don't know why you weren't seeing this problem with Lucene 1.4.2, but is it possible that the 1.4.2-created index did *not* have these two symbols? If that were true, then you would get the hits you're looking for, though you might get some others that you don't want. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]