[jira] Commented: (LUCENE-1438) StandardTokenizer splits host names with hyphens into multiple tokens

Shai Erera (JIRA) Tue, 16 Dec 2008 12:59:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657161#action_12657161
 ]


Shai Erera commented on LUCENE-1438:
------------------------------------

These two are not so simple to tackle. They are results of several rules.

1-800-flowers.com is split that way because of the NUM rule, and the HOST rule. 
The HOST rule requires the token to have some alphanumeric characters, followed 
by one or more ("." {ALPHANUM}) strings. Therefore this string is not detected 
as a HOST.
If we were to change the rule to recognize that string as a HOST, then we'd be 
wrong for strings like "file.pdf", which is clearly not a host. So I don't 
think how we can satisfy everyone.

For the string www.1-800-flowers.com - the reason it's split is because of the 
"-" not included in the HOST definition.

But .. this was stated already in other threads w.r.t. StandardTokenizer - it 
is just a default tokenizer that ships with Lucene. It is not meant to be *THE* 
tokenizer, and what will make sense to one will not fit the other.

Personally, I think that if you want to correctly identify hosts (or emails, or 
any other pattern), you should use a specially written annotator (it will be 
interesting to see a contribution, if there isn't one yet, that integrates 
between UIMA and Analyzer), rather than rely on some rule that we can argue 
about for ages. Or ... simply copy most of StandardTokenizer's grammar and 
change the HOST. I believe it will generate more promising results than trying 
to change the HOST definition now, since it depends on other definitions, like 
{NUM}, which sometimes can override other definitions.

> StandardTokenizer splits host names with hyphens into multiple tokens
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-1438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1438
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Robert Newson
>
> StandardTokenizer does not recognize host names with hyphens as a single HOST 
> token. Specifically "www.m-w.com" is tokenized as "www.m" and "w.com", both 
> of "<HOST>" type.
> StandardTokenizer should instead output a single HOST token for 
> "www.m-w.com", since hyphens are a legitimate character in DNS host names.
> We've a local fix to the grammar file which also required us to significantly 
> simplify the NUM type to get the behavior we needed for host names.
> here's a junit test for the desired behavior;
>       public void testWithHyphens() throws Exception {
>               final String host = "www.m-w.com";
>               final StandardTokenizer tokenizer = new StandardTokenizer(
>                               new StringReader(host));
>               final Token token = new Token();
>               tokenizer.next(token);
>               assertEquals("<HOST>", token.type());
>               assertEquals("www.m-w.com", token.term());
>       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1438) StandardTokenizer splits host names with hyphens into multiple tokens

Reply via email to