A good suggestion. But I'm using Lucene 3.0.2 and the constructor for a 
StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1 is 
due?

-----Original Message-----
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 24 Oct 2010 21 31
To: java-user@lucene.apache.org
Subject: RE: Use of hyphens in StandardAnalyzer

Hi Martin,

StandardTokenizer and -Analyzer have been changed, as of future version 3.1 
(the next release) to support the Unicode segmentation rules in UAX#29.  My 
(untested) guess is that your hyphenated word will be kept as a single token if 
you set the version to 3.1 or higher in the constructor.

Steve

> -----Original Message-----
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Sunday, October 24, 2010 3:59 PM
> To: java-user@lucene.apache.org
> Subject: Use of hyphens in StandardAnalyzer
> 
> Hello
> 
> 
> 
> I have a StandardAnalyzer working which retrieves words and frequencies
> from
> a single document using a TermVectorMapper which is populating a HashMap.
> 
> 
> 
> But if I use the following text as a field in my document, i.e.
> 
> 
> 
> addDoc(w, "lucene Lawton-Browne Lucene");
> 
> 
> 
> The word frequencies returned in the HashMap are:
> 
> 
> 
> browne 1
> 
> lucene 2
> 
> lawton 1
> 
> 
> 
> The problem is the words 'lawton' and 'browne'. If this is an actual
> 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
> the
> name is actually a single word?
> 
> 
> 
> I've tried combinations of:
> 
> 
> 
> addDoc(w, "lucene \"Lawton-Browne\" Lucene");
> 
> 
> 
> And single quotes but without success.
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.
> 
> 
> 
> 
> 
> 






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to