You could look in to modifying the standard tokenizer lexer code to
handle punctuation (there is a patch in the isssue tracker for the old
javacc grammer to handle punctuation) and there is also the Gate NLP
project which has a fairly nice sentence splitter you might find
useful. Add a whole bunch of position increment between your sentences
and limit your searches to how much distance you allow for a hit.
I hope this helps.
karl
30 sep 2009 kl. 05.54 skrev Max Lynch:
I would like my searches to match "John Smith" when John Smith is in a
document, but not separated with punctuation. For example, when I
was using
StandardAnalyzer, "John. Smith" was matching, which is wrong for
me. Right
now I am using WhitespaceAnalyzer but instead searching for "John
Smith"
"John Smith." "John Smith," etc., which seems like a dumb thing to be
doing. Can I separate the punctuation but keep the analyzer aware
of where
the punctuation occurred in my matching term?
Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org