Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively easy) to create an analyzer (or a modification of the standard one's lexer) so that punctuation is returned as a separate token type?
Dawid On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sar...@gmail.com> wrote: > Hi Paul, > > StandardTokenizer implements the Word Boundaries rules in the Unicode Text > Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode > 6.1.0, which is the version supported by Lucene 4.1.0: > <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>. > > Only those sequences between boundaries that contain letters and/or digits > are returned as tokens; all other sequences between boundaries are skipped > over and not returned as tokens. > > Steve > > On Sep 30, 2014, at 3:54 PM, Paul Taylor <paul_t...@fastmail.fm> wrote: > >> Does StandardTokenizer remove punctuation (in Lucene 4.1) >> >> Im just trying to move back to StandardTokenizer from my own old custom >> implemenation because the newer version seems to have much better support >> for Asian languages >> >> However this code except fails on incrementToken() implying that the !!! are >> removed from output, yet looking at the jflex classes I cant see anything to >> indicate punctuation is removed, is it removed and if so can i remove it ? >> >> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, >> new StringReader("!!!")); >> assertNotNull(tokenizer); >> tokenizer.reset(); >> assertTrue(tokenizer.incrementToken()); >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org