>> input=esl2.gbr >> output=[esl2.gb][r] > > >> This is a bug, which was fixed in Lucene 4.7 - see < https://issues.apache.org/jira/browse/LUCENE-5391>
BTW, I changed the POM dependency to 4.7.1, but I'm still seeing the same output. I can't go beyond 4.7 since it seems 4.8 onwards, Lucene is being compiled against Java 7 and I'm still on Java 6. Hopefully, this will be a non-issue with PerFieldAnalyzerWrapper. But I just wanted to point that out. On Wed, Jul 23, 2014 at 7:34 PM, Milind <mili...@gmail.com> wrote: > Brilliant. Thanks! > > > On Wed, Jul 23, 2014 at 6:12 PM, Steve Rowe <sar...@gmail.com> wrote: > >> See PerFieldAnalyzerWrapper, which is itself an Analyzer: < >> http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html >> > >> >> Steve >> >> On Jul 23, 2014, at 6:00 PM, Milind <mili...@gmail.com> wrote: >> >> > Thanks Steve, that helped. I had forgotten about the URL part of the >> > Analyzer since I was using it for the email field. I need to see if >> it's >> > possible to use different analyzers for different fields. If so, then >> I'll >> > use the UAX29URLEmailAnalyzer only for the email field and use >> > StandardAnalyzer for everything else. I'm not sure if that would work >> > though. Since I'm using the MultiFieldQueryParser and that takes in a >> > single Analyzer. >> > >> > >> > On Wed, Jul 23, 2014 at 3:29 PM, Steve Rowe <sar...@gmail.com> wrote: >> > >> >> Hi Milind, >> >> >> >> On Jul 23, 2014, at 1:49 PM, Milind <mili...@gmail.com> wrote: >> >> >> >>> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I >> >>> expected. Is this a bug in the analyzer or is this working as >> designed? >> >>> >> >>> If I use the UAX29URLEmailAnalyzer, it tokenizes the following >> strings as >> >>> input=bwl-esl2.gbr.hp.com >> >>> output=[bwl-esl2.gbr.hp.com] >> >> >> >> This is the correct tokenization of a valid domain name with token type >> >> <URL>: the hyphen (‘-‘) is an allowed character in DNS names. From RFC >> >> 1035 Domain Implementation and Specification < >> >> http://www.ietf.org/rfc/rfc1035.txt>: >> >> >> >> <domain> ::= <subdomain> | " " >> >> <subdomain> ::= <label> | <subdomain> "." <label> >> >> <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] >> >> <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str> >> >> <let-dig-hyp> ::= <let-dig> | "-" >> >> <let-dig> ::= <letter> | <digit> >> >> >> >> <letter> ::= any one of the 52 alphabetic characters A through Z in >> >> upper case and a through z in lower case >> >> >> >> <digit> ::= any one of the ten digits 0 through 9 >> >> >> >> Note that while upper and lower case letters are allowed in domain >> >> names, no significance is attached to the case. That is, two names >> >> with >> >> the same spelling but different case are to be treated as if >> identical. >> >> >> >> The labels must follow the rules for ARPANET host names. They must >> >> start with a letter, end with a letter or digit, and have as >> interior >> >> characters only letters, digits, and hyphen. There are also some >> >> restrictions on the length. Labels must be 63 characters or less. >> >> >> >> From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex: >> >> >> >> DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])? >> >> DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD} >> >> URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} | >> >> {DomainNameStrict} >> >> […] >> >> {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; } >> >> >> >>> input=esl2.gbr >> >>> output=[esl2.gb][r] >> >> >> >> This is a bug, which was fixed in Lucene 4.7 - see < >> >> https://issues.apache.org/jira/browse/LUCENE-5391> >> >> >> >> Steve >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > >> > -- >> > Regards >> > Milind >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Regards > Milind > -- Regards Milind