Let's take a quick step back and see if it helps. Why do you feel you need the StandardAnalyzer to solve your problem? What else are you gaining from it? Would you be better served by a WhitespaceTokenizer?

That being said, hacking up the grammar isn't as bad as you might think. There are actually two examples of the "grammar" in Lucene, one is the StdTokenizer and the other is the WikipediaTokenizer. They are similar, but maybe by looking at two examples it might also help.


On Dec 9, 2008, at 10:14 AM, Greg Shackles wrote:

Hey everyone,

I'm running into a problem where some punctuation that I would actually want to keep gets thrown out because they don't get tokenized. By far the most common case for this is ampersand, but it does happen with others as well. My concern isn't even so much in that I need to be able to enforce that punctuation in the search, but more that I need to know it was there when I get the results. I am attaching important word data to the payload of each token, so if a "word" was just an ampersand, it disappears. I took a quick look at the StandardAnalyzer classes and it looks like it would be a pain to
try and modify that directly (I don't have much experience in
grammar/parsers). A couple options come to mind, but I wanted to make sure there wasn't a better, more elegant solution before I did something that
felt a little hacky:

1) Add a couple fields to the payload saying whether the previous/ next word is a single punctuation mark, and which it is. Then the search can insert the punctuation in the results. The downside to this would be losing the metadata that would have gone into the payload for that punctuation mark.

2) Do some sort of string replacement logic during indexing and searching to change it into something that will get made into a token, but should not appear naturally on its own in the text. I usually shy away from solutions
like this, but sometimes they prove useful.

Has anyone done anything like this?  I don't want to lose most of
StandardAnalyzer's punctuation logic, but mainly I want to tokenize
punctuation if it appears by itself (surrounded by whitespace). Thanks!

- Greg

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to