Clinton Gormley created LUCENE-4766:
---------------------------------------
Summary: Pattern token filter which emits a token for every
capturing group
Key: LUCENE-4766
URL: https://issues.apache.org/jira/browse/LUCENE-4766
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Priority: Minor
Fix For: 4.2
The PatternTokenizer either functions by splitting on matches, or allows you to
specify a single capture group. This is insufficient for my needs. Quite often
I want to capture multiple overlapping tokens in the same position.
I've written a pattern token filter which accepts multiple patterns and emits
tokens for every capturing group that is matched in any pattern.
Patterns are not anchored to the beginning and end of the string, so each
pattern can produce multiple matches.
For instance a pattern like "(([a-z]+)(\d*))" when matched against
"abc123def456" would produce the tokens:
abc123, abc, 123, def456, def, 456
Multiple patterns can be applied, eg these patterns could be used for camelCase
analysis:
"([A-Z]{2,})",
"(?<![A-Z])([A-Z][a-z]+)",
"(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
"([0-9]+)"
When matched against the string "letsPartyLIKEits1999_dude", they would produce
the tokens:
lets, Party, LIKE, its, 1999, dude
If no token is emitted, the original token is preserved.
If the preserveOriginal flag is true, it will output the full original token
(ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in
this case, if a matching token is identical to the original, it will only emit
one copy of the full token).
Multiple patterns are required to allow overlapping captures, but also means
that patterns are less dense and easier to understand.
This is my first Java code, so apologies if I'm doing something stupid.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]