Hi Steve,

I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively easy) to create an
analyzer (or a modification of the standard one's lexer) so that
punctuation is returned as a separate token type?

Dawid


On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sar...@gmail.com> wrote:
> Hi Paul,
>
> StandardTokenizer implements the Word Boundaries rules in the Unicode Text 
> Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 
> 6.1.0, which is the version supported by Lucene 4.1.0: 
> <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>
> Only those sequences between boundaries that contain letters and/or digits 
> are returned as tokens; all other sequences between boundaries are skipped 
> over and not returned as tokens.
>
> Steve
>
> On Sep 30, 2014, at 3:54 PM, Paul Taylor <paul_t...@fastmail.fm> wrote:
>
>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>
>> Im just trying to move back to StandardTokenizer from my own old custom 
>> implemenation because the newer version seems to have much better support 
>> for Asian languages
>>
>> However this code except fails on incrementToken() implying that the !!! are 
>> removed from output, yet looking at the jflex classes I cant see anything to 
>> indicate punctuation is removed, is it removed and if so can i remove it ?
>>
>> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, 
>> new StringReader("!!!"));
>> assertNotNull(tokenizer);
>> tokenizer.reset();
>> assertTrue(tokenizer.incrementToken());
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to