[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

Robert Muir (JIRA) Mon, 11 Feb 2013 05:25:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575776#comment-13575776
 ]


Robert Muir commented on LUCENE-4766:
-------------------------------------

The positions are used for searching, offsets for highlighting.

So you can (unfortunately) set the offsets to whatever you want, it wont affect 
searches. Instead it will only cause problems for highlighting. An example of 
this is: https://issues.apache.org/jira/browse/SOLR-4137

For a tokenfilter, it doesnt make sense to change offsets, because a tokenizer 
already broke the document into words and mapped them back to their original 
location in the document.

If a tokenfilter REALLY needs to change offsets, then its a sign its 
subclassing the wrong analysis type and should be a tokenizer: because its 
trying to break the document into words, not just alter existing tokenization :)

                
> Pattern token filter which emits a token for every capturing group
> ------------------------------------------------------------------
>
>                 Key: LUCENE-4766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4766
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.1
>            Reporter: Clinton Gormley
>            Assignee: Simon Willnauer
>            Priority: Minor
>              Labels: analysis, feature, lucene
>             Fix For: 4.2
>
>         Attachments: LUCENE-4766.patch, LUCENE-4766.patch
>
>
> The PatternTokenizer either functions by splitting on matches, or allows you 
> to specify a single capture group.  This is insufficient for my needs. Quite 
> often I want to capture multiple overlapping tokens in the same position.
> I've written a pattern token filter which accepts multiple patterns and emits 
> tokens for every capturing group that is matched in any pattern.
> Patterns are not anchored to the beginning and end of the string, so each 
> pattern can produce multiple matches.
> For instance a pattern like :
> {code}
>     "(([a-z]+)(\d*))"
> {code}
> when matched against: 
> {code}
>     "abc123def456"
> {code}
> would produce the tokens:
> {code}
>     abc123, abc, 123, def456, def, 456
> {code}
> Multiple patterns can be applied, eg these patterns could be used for 
> camelCase analysis:
> {code}
>     "([A-Z]{2,})",
>     "(?<![A-Z])([A-Z][a-z]+)",
>     "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
>     "([0-9]+)"
> {code}
> When matched against the string "letsPartyLIKEits1999_dude", they would 
> produce the tokens:
> {code}
>     lets, Party, LIKE, its, 1999, dude
> {code}
> If no token is emitted, the original token is preserved. 
> If the preserveOriginal flag is true, it will output the full original token 
> (ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in 
> this case, if a matching token is identical to the original, it will only 
> emit one copy of the full token).
> Multiple patterns are required to allow overlapping captures, but also means 
> that patterns are less dense and easier to understand.
> This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

Reply via email to