Re: Whitespace/Standard Analyzer and punctuation

Karl Wettin Wed, 30 Sep 2009 02:28:08 -0700

You could look in to modifying the standard tokenizer lexer code tohandle punctuation (there is a patch in the isssue tracker for the oldjavacc grammer to handle punctuation) and there is also the Gate NLPproject which has a fairly nice sentence splitter you might finduseful. Add a whole bunch of position increment between your sentencesand limit your searches to how much distance you allow for a hit.


I hope this helps.



       karl


30 sep 2009 kl. 05.54 skrev Max Lynch:

I would like my searches to match "John Smith" when John Smith is in a
document, but not separated with punctuation. For example, when Iwas usingStandardAnalyzer, "John. Smith" was matching, which is wrong forme. Rightnow I am using WhitespaceAnalyzer but instead searching for "JohnSmith"
"John Smith." "John Smith," etc., which seems like a dumb thing to be
doing. Can I separate the punctuation but keep the analyzer awareof where
the punctuation occurred in my matching term?

Thanks.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Whitespace/Standard Analyzer and punctuation

Reply via email to