subject:"Does StandardTokenizer remove punctuation \(in Lucene 4.1\)"

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-02 Thread Steve Rowe

Paul, You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds better handling for some languages to UAX#29 Word Break rules conformance, and also finds token boundaries when the writing system (aka script) changes. This is intended to be extensible per script. The root br

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor

On 01/10/2014 18:42, Steve Rowe wrote: Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. Yeah sure, I did try this and hit a

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Steve Rowe

Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese,

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Michael McCandless

s are skipped >> over and not returned as tokens. >> >> Steve >> >> On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote: >> >>> Does StandardTokenizer remove punctuation (in Lucene 4.1) >>> >>> Im just trying to move back to Stan

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor

On 01/10/2014 08:08, Dawid Weiss wrote: Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively eas

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Dawid Weiss

as tokens; all other sequences between boundaries are skipped > over and not returned as tokens. > > Steve > > On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote: > >> Does StandardTokenizer remove punctuation (in Lucene 4.1) >> >> Im just trying to move back to Stan

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Steve Rowe

Only those sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens. Steve On Sep 30, 2014, at 3:54 PM, Paul Taylor wrote: > Does StandardTokenizer remove punctuation (in L

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Jack Krupansky

ucene 4.1) Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying tha

Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Paul Taylor

Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying that the

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Does StandardTokenizer remove punctuation (in Lucene 4.1)

9 matches

Site Navigation

Mail list logo

Footer information