[
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575739#comment-13575739
]
Adrien Grand commented on LUCENE-4766:
--------------------------------------
bq. I just wonder if we really should restrict our TF to not fix offsets? Kind
of an odd thing though. What should a tokenfilter like this do instead?
I think that for some examples, it makes sense not to fix offsets? In the case
of the URL example ({{(https?://([a-zA-Z\-_0-9.]+))}}), I think it makes sense
to highlight the whole URL (including the leading http(s)://) even if the query
term is just {{www.mysite.com}}. On the other hand, it could be weird if the
goal was to split a long CamelCase token (letsPartyLIKEits1999_dude), but maybe
this should be done by a Tokenizer rather than a TokenFilter?
(No strong feeling here, I'd just like to see if we can find a way to commit
this patch without having to grow our TokenFilter exclusion list.)
> Pattern token filter which emits a token for every capturing group
> ------------------------------------------------------------------
>
> Key: LUCENE-4766
> URL: https://issues.apache.org/jira/browse/LUCENE-4766
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 4.1
> Reporter: Clinton Gormley
> Assignee: Simon Willnauer
> Priority: Minor
> Labels: analysis, feature, lucene
> Fix For: 4.2
>
> Attachments: LUCENE-4766.patch, LUCENE-4766.patch
>
>
> The PatternTokenizer either functions by splitting on matches, or allows you
> to specify a single capture group. This is insufficient for my needs. Quite
> often I want to capture multiple overlapping tokens in the same position.
> I've written a pattern token filter which accepts multiple patterns and emits
> tokens for every capturing group that is matched in any pattern.
> Patterns are not anchored to the beginning and end of the string, so each
> pattern can produce multiple matches.
> For instance a pattern like :
> {code}
> "(([a-z]+)(\d*))"
> {code}
> when matched against:
> {code}
> "abc123def456"
> {code}
> would produce the tokens:
> {code}
> abc123, abc, 123, def456, def, 456
> {code}
> Multiple patterns can be applied, eg these patterns could be used for
> camelCase analysis:
> {code}
> "([A-Z]{2,})",
> "(?<![A-Z])([A-Z][a-z]+)",
> "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
> "([0-9]+)"
> {code}
> When matched against the string "letsPartyLIKEits1999_dude", they would
> produce the tokens:
> {code}
> lets, Party, LIKE, its, 1999, dude
> {code}
> If no token is emitted, the original token is preserved.
> If the preserveOriginal flag is true, it will output the full original token
> (ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in
> this case, if a matching token is identical to the original, it will only
> emit one copy of the full token).
> Multiple patterns are required to allow overlapping captures, but also means
> that patterns are less dense and easier to understand.
> This is my first Java code, so apologies if I'm doing something stupid.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]