Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Milind Wed, 23 Jul 2014 15:01:20 -0700

Thanks Steve, that helped.  I had forgotten about the URL part of the
Analyzer since I was using it for the email field.  I need to see if it's
possible to use different analyzers for different fields.  If so, then I'll
use the UAX29URLEmailAnalyzer only for the email field and use
StandardAnalyzer for everything else.  I'm not sure if that would work
though.  Since I'm using the MultiFieldQueryParser and that takes in a
single Analyzer.



On Wed, Jul 23, 2014 at 3:29 PM, Steve Rowe <[email protected]> wrote:

> Hi Milind,
>
> On Jul 23, 2014, at 1:49 PM, Milind <[email protected]> wrote:
>
> > The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
> > expected.  Is this a bug in the analyzer or is this working as designed?
> >
> > If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as
> >    input=bwl-esl2.gbr.hp.com
> >    output=[bwl-esl2.gbr.hp.com]
>
> This is the correct tokenization of a valid domain name with token type
> <URL>: the hyphen (‘-‘) is an allowed character in DNS names.  From RFC
> 1035 Domain Implementation and Specification <
> http://www.ietf.org/rfc/rfc1035.txt>:
>
>     <domain> ::= <subdomain> | " "
>     <subdomain> ::= <label> | <subdomain> "." <label>
>     <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
>     <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
>     <let-dig-hyp> ::= <let-dig> | "-"
>     <let-dig> ::= <letter> | <digit>
>
>     <letter> ::= any one of the 52 alphabetic characters A through Z in
>     upper case and a through z in lower case
>
>     <digit> ::= any one of the ten digits 0 through 9
>
>     Note that while upper and lower case letters are allowed in domain
>     names, no significance is attached to the case.  That is, two names
> with
>     the same spelling but different case are to be treated as if identical.
>
>     The labels must follow the rules for ARPANET host names.  They must
>     start with a letter, end with a letter or digit, and have as interior
>     characters only letters, digits, and hyphen.  There are also some
>     restrictions on the length.  Labels must be 63 characters or less.
>
> From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex:
>
>     DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])?
>     DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD}
>     URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} |
> {DomainNameStrict}
>     […]
>     {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; }
>
> >    input=esl2.gbr
> >    output=[esl2.gb][r]
>
> This is a bug, which was fixed in Lucene 4.7 - see <
> https://issues.apache.org/jira/browse/LUCENE-5391>
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Regards
Milind

Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Reply via email to