Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Paul Taylor Wed, 01 Oct 2014 01:05:19 -0700

On 01/10/2014 08:08, Dawid Weiss wrote:

Hi Steve,


I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively easy) to create an
analyzer (or a modification of the standard one's lexer) so that
punctuation is returned as a separate token type?

Dawid


On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <[email protected]> wrote:

Hi Paul,

StandardTokenizer implements the Word Boundaries rules in the Unicode Text 
Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, 
which is the version supported by Lucene 4.1.0: 
<http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.

Only those sequences between boundaries that contain letters and/or digits are 
returned as tokens; all other sequences between boundaries are skipped over and 
not returned as tokens.

Steve

Yep, I need punctuation in fact the only thing I usually want removed iswhitespace yet I would to take advantage of the fact that the newtokenizer can recognise some word boundaries that are not based onwhitespace in the case of some non western languages). I have modifiedthe tokenizer before but found it very diificult to understand it, is itpossible/advisable to contstruct a tokenizer just based on pure javacode rather than derived from a jflex definition ?


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

Reply via email to