Re: StandardAnalyzer functionality change

Jack Krupansky Wed, 24 Oct 2012 13:06:40 -0700

s/work break/word break/

-- Jack Krupansky

-----Original Message-----From: Jack Krupansky

Sent: Wednesday, October 24, 2012 3:52 PM
To: [email protected] ; kiwi clive
Subject: Re: StandardAnalyzer functionality change

I didn't explicitly say it, but ClassicAnalyzer does do exactly what you
want it to do - work break plus email and URL, or StandardAnalyzer plus
email and URL.

-- Jack Krupansky

-----Original Message-----From: kiwi clive

Sent: Wednesday, October 24, 2012 1:27 PM
To: [email protected]
Subject: Re: StandardAnalyzer functionality change

Thanks for the responses chaps, very informative, and most appreciated :-)





________________________________
From: Ian Lea <[email protected]>
To: [email protected]
Sent: Wednesday, October 24, 2012 4:19 PM
Subject: Re: StandardAnalyzer functionality change

If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <[email protected]>
wrote:

Yes, by design. StandardAnalyzer implements "simple word boundaries" (the

technical term is "Unicode text segmentation"), period. As the javadocsays,

"As of Lucene version 3.1, this class implements the Word Break rules from
the Unicode Text Segmentation algorithm, as specified in Unicode Standard
Annex #29." That is a "standard".

See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

-- Jack Krupansky

-----Original Message----- From: kiwi clive
Sent: Wednesday, October 24, 2012 6:42 AM
To: [email protected]
Subject: StandardAnalyzer functionality change


Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene3.6.0

and I see StandardAnalyzer has changed its behaviour, particularly when
tokenizing email addresses. From reading the forums, I understand
StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?


If I pass the string '[email protected]' through these analyzers, I get the
following tokens:

Using StandardAnalyzer(Version.LUCENE_23): --> [email protected] (onetoken)


Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
tokens)
Using ClassicAnalyzer(Version.LUCENE_36):     -->  [email protected]  (one
token)

StandardAnalyzer is normally a good compromise as a default analyzer butthe

failure to keep an email address intact makes it less fit for purpose than

it used to be. Is this a bug or is it by design ? If by design, what isthereason for the change and is ClassicAnalyzer now the defacto-analyzer touse

?

Thanks,
Clive

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]

For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: StandardAnalyzer functionality change

Reply via email to