Re: Lucene Analyzer that can handle C++ vs C#

Chris Lu Fri, 11 Dec 2009 15:58:16 -0800

What we did in DBSight is to provide a reserved list of words for everyLucene Analyzer.

This way you can handle any special characters like C++ and C#.


Any common analyzers usually are not suitable for these special words.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 
Million Euro funding!


On 12/11/2009 9:09 AM, maxSchlein wrote:

Can someone please point me in the right direction.

We are creating an application that needs to beable to search on C++ and get
back doc's that have C++ in it.  The StandardAnalyzer does not seem to index
the "+", so a search for "C++" will bring back docs that contain, C++, C,
C#, etc.....  The WhiteSpaceAnalyzer will index the "+", but if we have the
term "C++." that is, if C++ is at the end of a sentence, it will index
"C++." so a search for "C++" will not return the doc.  I have heard of maybe
a CustomAnalyzer; however, it seems like there would actually need to be a
CustomFilter/CustomTokenizer, I looked at:
      - StandardAnalyzer.java
      - StandardFilter.java
      - StandardTokenizer.java
      - StandardTokenizerImpl.java
      - StandardTokenizerImpl.jflex

I would guess that the StandardTokenizer is where the changes would need to
be made to allow the "+" character, but I am unclear as to how.

Any and all help is greatly appreciated.

Going thru all the documents, stripping out "+" for the word "plus" is not
really an option for us.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene Analyzer that can handle C++ vs C#

Reply via email to