Re: StandardAnalyzer functionality change

Jack Krupansky Wed, 24 Oct 2012 07:57:14 -0700

Yes, by design. StandardAnalyzer implements "simple word boundaries" (thetechnical term is "Unicode text segmentation"), period. As the javadoc says,"As of Lucene version 3.1, this class implements the Word Break rules fromthe Unicode Text Segmentation algorithm, as specified in Unicode StandardAnnex #29." That is a "standard".


See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html


-- Jack Krupansky

-----Original Message-----From: kiwi clive

Sent: Wednesday, October 24, 2012 6:42 AM
To: [email protected]
Subject: StandardAnalyzer functionality change

Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0and I see StandardAnalyzer has changed its behaviour, particularly whentokenizing email addresses. From reading the forums, I understandStandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?

If I pass the string '[email protected]' through these analyzers, I get thefollowing tokens:


Using StandardAnalyzer(Version.LUCENE_23):  -->  [email protected] (one token)

Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com (twotokens)Using ClassicAnalyzer(Version.LUCENE_36): --> [email protected] (onetoken)

StandardAnalyzer is normally a good compromise as a default analyzer but thefailure to keep an email address intact makes it less fit for purpose thanit used to be. Is this a bug or is it by design ? If by design, what is thereason for the change and is ClassicAnalyzer now the defacto-analyzer to use?


Thanks,

Clive


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: StandardAnalyzer functionality change

Reply via email to