You could look in to modifying the standard tokenizer lexer code to handle punctuation (there is a patch in the isssue tracker for the old javacc grammer to handle punctuation) and there is also the Gate NLP project which has a fairly nice sentence splitter you might find useful. Add a whole bunch of position increment between your sentences and limit your searches to how much distance you allow for a hit.

I hope this helps.


       karl


30 sep 2009 kl. 05.54 skrev Max Lynch:

I would like my searches to match "John Smith" when John Smith is in a
document, but not separated with punctuation. For example, when I was using StandardAnalyzer, "John. Smith" was matching, which is wrong for me. Right now I am using WhitespaceAnalyzer but instead searching for "John Smith"
"John Smith." "John Smith," etc., which seems like a dumb thing to be
doing. Can I separate the punctuation but keep the analyzer aware of where
the punctuation occurred in my matching term?

Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to