Re: Looking for a way to customize how StandardAnalyzer handles punctuation

Grant Ingersoll Wed, 10 Dec 2008 17:53:11 -0800

Let's take a quick step back and see if it helps. Why do you feel youneed the StandardAnalyzer to solve your problem? What else are yougaining from it? Would you be better served by a WhitespaceTokenizer?

That being said, hacking up the grammar isn't as bad as you mightthink. There are actually two examples of the "grammar" in Lucene,one is the StdTokenizer and the other is the WikipediaTokenizer. Theyare similar, but maybe by looking at two examples it might also help.



On Dec 9, 2008, at 10:14 AM, Greg Shackles wrote:

Hey everyone,
I'm running into a problem where some punctuation that I wouldactually wantto keep gets thrown out because they don't get tokenized. By farthe mostcommon case for this is ampersand, but it does happen with others aswell.My concern isn't even so much in that I need to be able to enforcethatpunctuation in the search, but more that I need to know it was therewhen Iget the results. I am attaching important word data to the payloadof eachtoken, so if a "word" was just an ampersand, it disappears. I tooka quicklook at the StandardAnalyzer classes and it looks like it would be apain to
try and modify that directly (I don't have much experience in
grammar/parsers). A couple options come to mind, but I wanted tomake surethere wasn't a better, more elegant solution before I did somethingthat
felt a little hacky:
1) Add a couple fields to the payload saying whether the previous/next wordis a single punctuation mark, and which it is. Then the search caninsertthe punctuation in the results. The downside to this would belosing themetadata that would have gone into the payload for that punctuationmark.
2) Do some sort of string replacement logic during indexing andsearching tochange it into something that will get made into a token, but shouldnotappear naturally on its own in the text. I usually shy away fromsolutions
like this, but sometimes they prove useful.

Has anyone done anything like this?  I don't want to lose most of
StandardAnalyzer's punctuation logic, but mainly I want to tokenize
punctuation if it appears by itself (surrounded by whitespace).Thanks!
- Greg


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Looking for a way to customize how StandardAnalyzer handles punctuation

Reply via email to