Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Steve Rowe Thu, 02 Oct 2014 07:03:05 -0700

Paul,

You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds 
better handling for some languages to UAX#29 Word Break rules conformance, and 
also finds token boundaries when the writing system (aka script) changes.  This 
is intended to be extensible per script.


The root break iterator used by DefaultICUTokenizerConfig also ignores 
punctuation.  You can find its grammar at:

    lucene/analysis/icu/src/data/uax29/Default.rbbi

Steve

On Oct 1, 2014, at 4:22 PM, Paul Taylor <[email protected]> wrote:

> On 01/10/2014 18:42, Steve Rowe wrote:
>> Paul,
>> 
>> Boilerplate upgrade recommendation: consider using the most recent Lucene 
>> release (4.10.1) - it’s the most stable, performant, and featureful release 
>> available, and many bugs have been fixed since the 4.1 release.
> Yeah sure, I did try this and hit a load of errors but I certainly will do so.
>> FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, 
>> Korean, Thai, and other languages that don’t use whitespace to denote word 
>> boundaries, except those around punctuation.  Note that Lucene 4.1 does have 
>> specialized tokenizers for Simplified Chinese and Japanese: the smartcn and 
>> kuromoji analysis modules, respectively.
> So for Chinese, Japanese, Korean, Thai etc its just identifying that the 
> chars are from said language, and then we can do something clever with it 
> with subsequent filters such as CJBigramFilter right ?
> My big trouble is my code is meant to deal with any language  and I dont know 
> what language it in except by looking at the characters themselves  AND i 
> also have to deal with stuff that contains symbols, funny punctuation etc
>> It is possible to construct a tokenizer just based on pure java code - there 
>> are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and 
>> CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.
>> 
> Ah yes I discovered this today, what I would really like is a version of the 
> jflex StandardTokenizer but written in pure Java making it easier to tweak 
> it, but I'm a little concerned that If I naively write it from scratch I may 
> create something that doesnt perform very well.
> 
> Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Reply via email to