Hey everyone,

I'm running into a problem where some punctuation that I would actually want
to keep gets thrown out because they don't get tokenized.  By far the most
common case for this is ampersand, but it does happen with others as well.
My concern isn't even so much in that I need to be able to enforce that
punctuation in the search, but more that I need to know it was there when I
get the results.  I am attaching important word data to the payload of each
token, so if a "word" was just an ampersand, it disappears.  I took a quick
look at the StandardAnalyzer classes and it looks like it would be a pain to
try and modify that directly (I don't have much experience in
grammar/parsers).  A couple options come to mind, but I wanted to make sure
there wasn't a better, more elegant solution before I did something that
felt a little hacky:

1) Add a couple fields to the payload saying whether the previous/next word
is a single punctuation mark, and which it is.  Then the search can insert
the punctuation in the results.  The downside to this would be losing the
metadata that would have gone into the payload for that punctuation mark.

2) Do some sort of string replacement logic during indexing and searching to
change it into something that will get made into a token, but should not
appear naturally on its own in the text.  I usually shy away from solutions
like this, but sometimes they prove useful.

Has anyone done anything like this?  I don't want to lose most of
StandardAnalyzer's punctuation logic, but mainly I want to tokenize
punctuation if it appears by itself (surrounded by whitespace).  Thanks!

- Greg

Reply via email to