OK, gotcha. I now see what you mean. StandardAnalyzer uses the StandardTokenizer, whereas StopAnalyzer uses the LowerCaseTokenizer, which divides text at non-letters. What you most likely will need to do is create a Tokenizer that outputs the original token, and outputs the parts of it based on the LowerCaseTokenizer. Have a look at the TokenStream API. Essentially, you need to implement the next() method for your new Tokenizer. You probably could just have your tokenizer wrap the other two, by using StandardTokenizer to get your first level tokens, then, given a Token, run it through the LowerCaseTokenizer to see if it has any values for next(), which can be added to the stream.

Once you have your Tokenizer working you can wrap them into your new Analyzer to use the other filters as you see fit.

If you have "Lucene In Action", have a look at Chapter 4 for more details on how Tokenizers and TokenFilters work.

HTH,
Grant


On Mar 28, 2007, at 11:18 AM, TimF wrote:


Grant,
Thanks for your reply and the pointer to the custom code sample. I will be checking into that today. I did delve into the src for the OOTB analyzers and was aware of what they did. Still, the StandardAnalyzer does not do what
I want. The real difference between my needs and the results of the
StandardAnalyzer is that what I want is the union of the StandardAnalyzer
and the StopAnalyzer. If you refer back to my original example...

An example of the data might be as follows:
   Hello XY&Z Corporation - [EMAIL PROTECTED]
I would like the following terms to come out of the analyzer:
 [hello]  [xy&z]  [corporation] [EMAIL PROTECTED] [com]  //this is the
StandardAnalyzer output
as well as
  [xy] [z]  [abc] [example]

I figured that creating a custom analyzer is the only way to do that, but unfortunately I am not that familiar with how the analyzers "really" work
internally( I am more of a mathematician than a lexicon).

If you have any other thoughts or ideas I would love to hear.
Thanks,
Tim

Grant Ingersoll-6 wrote:

So, I think the answer is that StandardAnalyzer already has what you
state you want.  Is it, perhaps, that certain stopwords that you are
interested in are not currently being stopped?


--
View this message in context: http://www.nabble.com/Custom-Analyzer- Help-please-tf3469904.html#a9716016
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to