Paul,

Boilerplate upgrade recommendation: consider using the most recent Lucene 
release (4.10.1) - it’s the most stable, performant, and featureful release 
available, and many bugs have been fixed since the 4.1 release.

FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, 
Korean, Thai, and other languages that don’t use whitespace to denote word 
boundaries, except those around punctuation.  Note that Lucene 4.1 does have 
specialized tokenizers for Simplified Chinese and Japanese: the smartcn and 
kuromoji analysis modules, respectively.

It is possible to construct a tokenizer just based on pure java code - there 
are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and 
CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.

Steve
www.lucidworks.com

On Oct 1, 2014, at 4:04 AM, Paul Taylor <paul_t...@fastmail.fm> wrote:

> On 01/10/2014 08:08, Dawid Weiss wrote:
>> Hi Steve,
>> 
>> I have to admit I also find it frequently useful to include
>> punctuation as tokens (even if it's filtered out by subsequent token
>> filters for indexing, it's a useful to-have for other NLP tasks). Do
>> you think it'd be possible (read: relatively easy) to create an
>> analyzer (or a modification of the standard one's lexer) so that
>> punctuation is returned as a separate token type?
>> 
>> Dawid
>> 
>> 
>> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sar...@gmail.com> wrote:
>>> Hi Paul,
>>> 
>>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text 
>>> Segmentation Standard Annex UAX#29 - here’s the relevant section for 
>>> Unicode 6.1.0, which is the version supported by Lucene 4.1.0: 
>>> <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>> 
>>> Only those sequences between boundaries that contain letters and/or digits 
>>> are returned as tokens; all other sequences between boundaries are skipped 
>>> over and not returned as tokens.
>>> 
>>> Steve
> Yep, I need punctuation in fact the only thing I usually want removed is 
> whitespace yet I would to take advantage of the fact that the new tokenizer 
> can recognise some word boundaries that are not based on whitespace  in the 
> case of some non western languages). I have modified the tokenizer before but 
> found it very diificult to understand it, is it possible/advisable to 
> contstruct a tokenizer just based on pure java code rather than derived from 
> a jflex definition ?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to