Brilliant. Thanks!
On Wed, Jul 23, 2014 at 6:12 PM, Steve Rowe <sar...@gmail.com> wrote: > See PerFieldAnalyzerWrapper, which is itself an Analyzer: < > http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html > > > > Steve > > On Jul 23, 2014, at 6:00 PM, Milind <mili...@gmail.com> wrote: > > > Thanks Steve, that helped. I had forgotten about the URL part of the > > Analyzer since I was using it for the email field. I need to see if it's > > possible to use different analyzers for different fields. If so, then > I'll > > use the UAX29URLEmailAnalyzer only for the email field and use > > StandardAnalyzer for everything else. I'm not sure if that would work > > though. Since I'm using the MultiFieldQueryParser and that takes in a > > single Analyzer. > > > > > > On Wed, Jul 23, 2014 at 3:29 PM, Steve Rowe <sar...@gmail.com> wrote: > > > >> Hi Milind, > >> > >> On Jul 23, 2014, at 1:49 PM, Milind <mili...@gmail.com> wrote: > >> > >>> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I > >>> expected. Is this a bug in the analyzer or is this working as > designed? > >>> > >>> If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings > as > >>> input=bwl-esl2.gbr.hp.com > >>> output=[bwl-esl2.gbr.hp.com] > >> > >> This is the correct tokenization of a valid domain name with token type > >> <URL>: the hyphen (‘-‘) is an allowed character in DNS names. From RFC > >> 1035 Domain Implementation and Specification < > >> http://www.ietf.org/rfc/rfc1035.txt>: > >> > >> <domain> ::= <subdomain> | " " > >> <subdomain> ::= <label> | <subdomain> "." <label> > >> <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] > >> <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str> > >> <let-dig-hyp> ::= <let-dig> | "-" > >> <let-dig> ::= <letter> | <digit> > >> > >> <letter> ::= any one of the 52 alphabetic characters A through Z in > >> upper case and a through z in lower case > >> > >> <digit> ::= any one of the ten digits 0 through 9 > >> > >> Note that while upper and lower case letters are allowed in domain > >> names, no significance is attached to the case. That is, two names > >> with > >> the same spelling but different case are to be treated as if > identical. > >> > >> The labels must follow the rules for ARPANET host names. They must > >> start with a letter, end with a letter or digit, and have as interior > >> characters only letters, digits, and hyphen. There are also some > >> restrictions on the length. Labels must be 63 characters or less. > >> > >> From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex: > >> > >> DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])? > >> DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD} > >> URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} | > >> {DomainNameStrict} > >> […] > >> {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; } > >> > >>> input=esl2.gbr > >>> output=[esl2.gb][r] > >> > >> This is a bug, which was fixed in Lucene 4.7 - see < > >> https://issues.apache.org/jira/browse/LUCENE-5391> > >> > >> Steve > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > > > -- > > Regards > > Milind > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Regards Milind