Hey everyone, I'm running into a problem where some punctuation that I would actually want to keep gets thrown out because they don't get tokenized. By far the most common case for this is ampersand, but it does happen with others as well. My concern isn't even so much in that I need to be able to enforce that punctuation in the search, but more that I need to know it was there when I get the results. I am attaching important word data to the payload of each token, so if a "word" was just an ampersand, it disappears. I took a quick look at the StandardAnalyzer classes and it looks like it would be a pain to try and modify that directly (I don't have much experience in grammar/parsers). A couple options come to mind, but I wanted to make sure there wasn't a better, more elegant solution before I did something that felt a little hacky:
1) Add a couple fields to the payload saying whether the previous/next word is a single punctuation mark, and which it is. Then the search can insert the punctuation in the results. The downside to this would be losing the metadata that would have gone into the payload for that punctuation mark. 2) Do some sort of string replacement logic during indexing and searching to change it into something that will get made into a token, but should not appear naturally on its own in the text. I usually shy away from solutions like this, but sometimes they prove useful. Has anyone done anything like this? I don't want to lose most of StandardAnalyzer's punctuation logic, but mainly I want to tokenize punctuation if it appears by itself (surrounded by whitespace). Thanks! - Greg