Hi Martin,

StandardTokenizer and -Analyzer have been changed, as of future version 3.1 
(the next release) to support the Unicode segmentation rules in UAX#29.  My 
(untested) guess is that your hyphenated word will be kept as a single token if 
you set the version to 3.1 or higher in the constructor.

Steve

> -----Original Message-----
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Sunday, October 24, 2010 3:59 PM
> To: java-user@lucene.apache.org
> Subject: Use of hyphens in StandardAnalyzer
> 
> Hello
> 
> 
> 
> I have a StandardAnalyzer working which retrieves words and frequencies
> from
> a single document using a TermVectorMapper which is populating a HashMap.
> 
> 
> 
> But if I use the following text as a field in my document, i.e.
> 
> 
> 
> addDoc(w, "lucene Lawton-Browne Lucene");
> 
> 
> 
> The word frequencies returned in the HashMap are:
> 
> 
> 
> browne 1
> 
> lucene 2
> 
> lawton 1
> 
> 
> 
> The problem is the words 'lawton' and 'browne'. If this is an actual
> 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
> the
> name is actually a single word?
> 
> 
> 
> I've tried combinations of:
> 
> 
> 
> addDoc(w, "lucene \"Lawton-Browne\" Lucene");
> 
> 
> 
> And single quotes but without success.
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.
> 
> 
> 
> 
> 
> 

Reply via email to