Junte Zhang created LUCENE-8278:
-----------------------------------
Summary: UAX29URLEmailTokenizer is not detecting some tokens as
URL tag
Key: LUCENE-8278
URL: https://issues.apache.org/jira/browse/LUCENE-8278
Project: Lucene - Core
Issue Type: Bug
Reporter: Junte Zhang
We are using the UAX29URLEmailTokenizer so we can use the token types in our
plugins.
However, I noticed that the tokenizer is not detecting certain URLs as <URL>
but <ALPHANUM> instead.
Examples that are not working:
* example.com is <ALPHANUM>
* example.net is <ALPHANUM>
But:
* https://example.com is <URL>
* as is https://example.net
Examples that work:
* example.ch is <URL>
* example.co.uk is <URL>
* example.nl is <URL>
I have checked this JIRA, and could not find an issue. I have tested this on
Lucene (Solr) 6.4.1 and 7.3.
Could someone confirm my findings and advise what I could do to (help) resolve
this issue?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]