[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

Clinton Gormley (JIRA) Mon, 11 Feb 2013 06:33:16 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575811#comment-13575811
 ]


Clinton Gormley commented on LUCENE-4766:
-----------------------------------------

OK, so I should redo this as a tokenizer, and set positionLengths correctly.

One issue is that, because there are multiple patterns, the emitted tokens can 
overlap, eg:

{code}
   "foobarbaz" -> foo, foobar, oba, bar, baz
{code}

in which case I think I would need to emit:

{code}
    positions:         1, 1, 2, 3, 5
    position lengths:  2, 4, 2, 2, 1
    start offsets:     0, 0, 0, 0, 0
    end offsets:       3, 6, 3, 3, 3
{code}

Is this correct? It's starting to look quite complex...
                
> Pattern token filter which emits a token for every capturing group
> ------------------------------------------------------------------
>
>                 Key: LUCENE-4766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4766
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.1
>            Reporter: Clinton Gormley
>            Assignee: Simon Willnauer
>            Priority: Minor
>              Labels: analysis, feature, lucene
>             Fix For: 4.2
>
>         Attachments: LUCENE-4766.patch, LUCENE-4766.patch
>
>
> The PatternTokenizer either functions by splitting on matches, or allows you 
> to specify a single capture group.  This is insufficient for my needs. Quite 
> often I want to capture multiple overlapping tokens in the same position.
> I've written a pattern token filter which accepts multiple patterns and emits 
> tokens for every capturing group that is matched in any pattern.
> Patterns are not anchored to the beginning and end of the string, so each 
> pattern can produce multiple matches.
> For instance a pattern like :
> {code}
>     "(([a-z]+)(\d*))"
> {code}
> when matched against: 
> {code}
>     "abc123def456"
> {code}
> would produce the tokens:
> {code}
>     abc123, abc, 123, def456, def, 456
> {code}
> Multiple patterns can be applied, eg these patterns could be used for 
> camelCase analysis:
> {code}
>     "([A-Z]{2,})",
>     "(?<![A-Z])([A-Z][a-z]+)",
>     "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
>     "([0-9]+)"
> {code}
> When matched against the string "letsPartyLIKEits1999_dude", they would 
> produce the tokens:
> {code}
>     lets, Party, LIKE, its, 1999, dude
> {code}
> If no token is emitted, the original token is preserved. 
> If the preserveOriginal flag is true, it will output the full original token 
> (ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in 
> this case, if a matching token is identical to the original, it will only 
> emit one copy of the full token).
> Multiple patterns are required to allow overlapping captures, but also means 
> that patterns are less dense and easier to understand.
> This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

Reply via email to