Paul,
You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds
better handling for some languages to UAX#29 Word Break rules conformance, and
also finds token boundaries when the writing system (aka script) changes. This
is intended to be extensible per script.
The root br
On 01/10/2014 18:42, Steve Rowe wrote:
Paul,
Boilerplate upgrade recommendation: consider using the most recent Lucene
release (4.10.1) - it’s the most stable, performant, and featureful release
available, and many bugs have been fixed since the 4.1 release.
Yeah sure, I did try this and hit a
Paul,
Boilerplate upgrade recommendation: consider using the most recent Lucene
release (4.10.1) - it’s the most stable, performant, and featureful release
available, and many bugs have been fixed since the 4.1 release.
FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese,
s are skipped
>> over and not returned as tokens.
>>
>> Steve
>>
>> On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote:
>>
>>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>>
>>> Im just trying to move back to Stan
On 01/10/2014 08:08, Dawid Weiss wrote:
Hi Steve,
I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively eas
as tokens; all other sequences between boundaries are skipped
> over and not returned as tokens.
>
> Steve
>
> On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote:
>
>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>
>> Im just trying to move back to Stan
Only those sequences between boundaries that contain letters and/or digits are
returned as tokens; all other sequences between boundaries are skipped over and
not returned as tokens.
Steve
On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote:
> Does StandardTokenizer remove punctuation (in L
ucene 4.1)
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages
However this code except fails on incrementToken() implying tha
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages
However this code except fails on incrementToken() implying that the