Re: Custom Analyzer Help please

Grant Ingersoll Wed, 28 Mar 2007 08:23:52 -0800

OK, gotcha. I now see what you mean. StandardAnalyzer uses theStandardTokenizer, whereas StopAnalyzer uses the LowerCaseTokenizer,which divides text at non-letters. What you most likely will need todo is create a Tokenizer that outputs the original token, and outputsthe parts of it based on the LowerCaseTokenizer. Have a look at theTokenStream API. Essentially, you need to implement the next()method for your new Tokenizer. You probably could just have yourtokenizer wrap the other two, by using StandardTokenizer to get yourfirst level tokens, then, given a Token, run it through theLowerCaseTokenizer to see if it has any values for next(), which canbe added to the stream.

Once you have your Tokenizer working you can wrap them into your newAnalyzer to use the other filters as you see fit.

If you have "Lucene In Action", have a look at Chapter 4 for moredetails on how Tokenizers and TokenFilters work.


HTH,
Grant


On Mar 28, 2007, at 11:18 AM, TimF wrote:

Grant,
Thanks for your reply and the pointer to the custom code sample. Iwill bechecking into that today. I did delve into the src for the OOTBanalyzersand was aware of what they did. Still, the StandardAnalyzer doesnot do what
I want. The real difference between my needs and the results of the
StandardAnalyzer is that what I want is the union of theStandardAnalyzer
and the StopAnalyzer. If you refer back to my original example...

An example of the data might be as follows:
   Hello XY&Z Corporation - [EMAIL PROTECTED]
I would like the following terms to come out of the analyzer:
 [hello]  [xy&z]  [corporation] [EMAIL PROTECTED] [com]  //this is the
StandardAnalyzer output
as well as
  [xy] [z]  [abc] [example]
I figured that creating a custom analyzer is the only way to dothat, butunfortunately I am not that familiar with how the analyzers"really" work
internally( I am more of a mathematician than a lexicon).

If you have any other thoughts or ideas I would love to hear.
Thanks,
Tim

Grant Ingersoll-6 wrote:
So, I think the answer is that StandardAnalyzer already has what you
state you want.  Is it, perhaps, that certain stopwords that you are
interested in are not currently being stopped?
--
View this message in context: http://www.nabble.com/Custom-Analyzer-Help-please-tf3469904.html#a9716016
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Custom Analyzer Help please

Reply via email to