I am having a problem when searching for certain Unicode characters, such as the Registered Trademark. That's the Unicode character 00AE. It's also a problem searching for a Japanese Yen symbol (Unicode character 00A5).
I'm using the Lucene 2.0.0 jar file, and we used to use Lucene 1.4.2 jar file, where this used to work OK. But Lucene 2.0.0 doesn't work the same way. I see that the registered trademark is in the Lucene index file, so that's good. The problem comes when I try to search for these characters. I see that my query starts off OK, as this: ( (Locale:en) AND ( productName:(Digital„^95) ) ) (if you cannot see the Japanese Yen symbol, it comes directly after "Digital") Note: the "^95" is just a boost factor, and is OK. I'm using StandardAnalyzer and StandardTokenizer to create a new QueryParser , and after I call the "parse" method of the QueryParser, my query becomes this: +Locale:en +productName:digital^95.0 Notice that the Japanese Yen symbol is gone! I think it's because the StandardTokenizer.jj file doesn't handle this character, and so it throws it away. Is there any way to use a different Analyzer and/or Tokenizer, rather than building my own? And if I had created my Lucene indexes with the StandardAnalyzer, must I use the StandardAnalyzer and StandardTokenizer to search the index? Thanks.