Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-02 Thread Steve Rowe
Paul, You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds better handling for some languages to UAX#29 Word Break rules conformance, and also finds token boundaries when the writing system (aka script) changes. This is intended to be extensible per script. The root br

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
On 01/10/2014 18:42, Steve Rowe wrote: Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. Yeah sure, I did try this and hit a

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Steve Rowe
Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese,

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Michael McCandless
s are skipped >> over and not returned as tokens. >> >> Steve >> >> On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote: >> >>> Does StandardTokenizer remove punctuation (in Lucene 4.1) >>> >>> Im just trying to move back to Stan

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
On 01/10/2014 08:08, Dawid Weiss wrote: Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively eas

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Dawid Weiss
as tokens; all other sequences between boundaries are skipped > over and not returned as tokens. > > Steve > > On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote: > >> Does StandardTokenizer remove punctuation (in Lucene 4.1) >> >> Im just trying to move back to Stan

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Steve Rowe
Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens. Steve On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote: > Does StandardTokenizer remove punctuation (in L

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Jack Krupansky
ucene 4.1) Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying tha

Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Paul Taylor
Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying that the