[
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174026#comment-13174026
]
Robert Muir commented on LUCENE-3663:
-------------------------------------
I think actually that we should not remove tokens that aren't phone numbers.
sometimes there just might be other
things instead of phone numbers, or maybe the phone number
detection/normalization is just imperfect so its better
to not throw away, instead just no normalization happens, like a stemmer.
In general we can also assume the text is unstructured and might have other
stuff (this implies someone has a super-cool
tokenizer that doesnt split up any dirty phone numbers, but we just leave the
possibility)
Then i think the while loop could be removed, if the phone number normalization
succeeds mark the type as phone.
Otherwise in the exception case, output it unchanged.
then non-phonenumbers or whatever can be easily filtered out separately with a
subclass of FilteringTokenFilter.
> Add a phone number normalization TokenFilter
> --------------------------------------------
>
> Key: LUCENE-3663
> URL: https://issues.apache.org/jira/browse/LUCENE-3663
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Santiago M. Mola
> Priority: Minor
> Attachments: PhoneFilter.java
>
>
> Phone numbers can be found in the wild in an infinity variety of formats
> (e.g. with spaces, parenthesis, dashes, with or without country code, with
> letters in substitution of numbers). So some Lucene applications can benefit
> of phone normalization with a TokenFilter that gets a phone number in any
> format, and outputs it in a standard format, using a default country to guess
> country code if it's not present.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]