Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Paul Taylor Wed, 01 Oct 2014 13:23:44 -0700

On 01/10/2014 18:42, Steve Rowe wrote:

Paul,


Boilerplate upgrade recommendation: consider using the most recent Lucene 
release (4.10.1) - it’s the most stable, performant, and featureful release 
available, and many bugs have been fixed since the 4.1 release.

Yeah sure, I did try this and hit a load of errors but I certainly willdo so.

FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, 
Korean, Thai, and other languages that don’t use whitespace to denote word 
boundaries, except those around punctuation.  Note that Lucene 4.1 does have 
specialized tokenizers for Simplified Chinese and Japanese: the smartcn and 
kuromoji analysis modules, respectively.

So for Chinese, Japanese, Korean, Thai etc its just identifying that thechars are from said language, and then we can do something clever withit with subsequent filters such as CJBigramFilter right ?My big trouble is my code is meant to deal with any language and I dontknow what language it in except by looking at the characters themselvesAND i also have to deal with stuff that contains symbols, funnypunctuation etc

It is possible to construct a tokenizer just based on pure java code - there 
are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and 
CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.

Ah yes I discovered this today, what I would really like is a version ofthe jflex StandardTokenizer but written in pure Java making it easier totweak it, but I'm a little concerned that If I naively write it fromscratch I may create something that doesnt perform very well.


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Reply via email to