Thanks for the responses chaps, very informative, and most appreciated :-)
________________________________ From: Ian Lea <[email protected]> To: [email protected] Sent: Wednesday, October 24, 2012 4:19 PM Subject: Re: StandardAnalyzer functionality change If you want email addresses, UAX29URLEmailAnalyzer is another alternative. -- Ian. On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <[email protected]> wrote: > Yes, by design. StandardAnalyzer implements "simple word boundaries" (the > technical term is "Unicode text segmentation"), period. As the javadoc says, > "As of Lucene version 3.1, this class implements the Word Break rules from > the Unicode Text Segmentation algorithm, as specified in Unicode Standard > Annex #29." That is a "standard". > > See: > http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html > http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html > > -- Jack Krupansky > > -----Original Message----- From: kiwi clive > Sent: Wednesday, October 24, 2012 6:42 AM > To: [email protected] > Subject: StandardAnalyzer functionality change > > > Hi all, > > Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 > and I see StandardAnalyzer has changed its behaviour, particularly when > tokenizing email addresses. From reading the forums, I understand > StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ? > > > If I pass the string '[email protected]' through these analyzers, I get the > following tokens: > > Using StandardAnalyzer(Version.LUCENE_23): --> [email protected] (one token) > > Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com (two > tokens) > Using ClassicAnalyzer(Version.LUCENE_36): --> [email protected] (one > token) > > StandardAnalyzer is normally a good compromise as a default analyzer but the > failure to keep an email address intact makes it less fit for purpose than > it used to be. Is this a bug or is it by design ? If by design, what is the > reason for the change and is ClassicAnalyzer now the defacto-analyzer to use > ? > > Thanks, > Clive > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
