Let's take a quick step back and see if it helps. Why do you feel you
need the StandardAnalyzer to solve your problem? What else are you
gaining from it? Would you be better served by a WhitespaceTokenizer?
That being said, hacking up the grammar isn't as bad as you might
think. There are actually two examples of the "grammar" in Lucene,
one is the StdTokenizer and the other is the WikipediaTokenizer. They
are similar, but maybe by looking at two examples it might also help.
On Dec 9, 2008, at 10:14 AM, Greg Shackles wrote:
Hey everyone,
I'm running into a problem where some punctuation that I would
actually want
to keep gets thrown out because they don't get tokenized. By far
the most
common case for this is ampersand, but it does happen with others as
well.
My concern isn't even so much in that I need to be able to enforce
that
punctuation in the search, but more that I need to know it was there
when I
get the results. I am attaching important word data to the payload
of each
token, so if a "word" was just an ampersand, it disappears. I took
a quick
look at the StandardAnalyzer classes and it looks like it would be a
pain to
try and modify that directly (I don't have much experience in
grammar/parsers). A couple options come to mind, but I wanted to
make sure
there wasn't a better, more elegant solution before I did something
that
felt a little hacky:
1) Add a couple fields to the payload saying whether the previous/
next word
is a single punctuation mark, and which it is. Then the search can
insert
the punctuation in the results. The downside to this would be
losing the
metadata that would have gone into the payload for that punctuation
mark.
2) Do some sort of string replacement logic during indexing and
searching to
change it into something that will get made into a token, but should
not
appear naturally on its own in the text. I usually shy away from
solutions
like this, but sometimes they prove useful.
Has anyone done anything like this? I don't want to lose most of
StandardAnalyzer's punctuation logic, but mainly I want to tokenize
punctuation if it appears by itself (surrounded by whitespace).
Thanks!
- Greg
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]