Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.
FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation. Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively. It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer. Steve www.lucidworks.com On Oct 1, 2014, at 4:04 AM, Paul Taylor <paul_t...@fastmail.fm> wrote: > On 01/10/2014 08:08, Dawid Weiss wrote: >> Hi Steve, >> >> I have to admit I also find it frequently useful to include >> punctuation as tokens (even if it's filtered out by subsequent token >> filters for indexing, it's a useful to-have for other NLP tasks). Do >> you think it'd be possible (read: relatively easy) to create an >> analyzer (or a modification of the standard one's lexer) so that >> punctuation is returned as a separate token type? >> >> Dawid >> >> >> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sar...@gmail.com> wrote: >>> Hi Paul, >>> >>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text >>> Segmentation Standard Annex UAX#29 - here’s the relevant section for >>> Unicode 6.1.0, which is the version supported by Lucene 4.1.0: >>> <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>. >>> >>> Only those sequences between boundaries that contain letters and/or digits >>> are returned as tokens; all other sequences between boundaries are skipped >>> over and not returned as tokens. >>> >>> Steve > Yep, I need punctuation in fact the only thing I usually want removed is > whitespace yet I would to take advantage of the fact that the new tokenizer > can recognise some word boundaries that are not based on whitespace in the > case of some non western languages). I have modified the tokenizer before but > found it very diificult to understand it, is it possible/advisable to > contstruct a tokenizer just based on pure java code rather than derived from > a jflex definition ? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org