Hi Martin, StandardTokenizer and -Analyzer have been changed, as of future version 3.1 (the next release) to support the Unicode segmentation rules in UAX#29. My (untested) guess is that your hyphenated word will be kept as a single token if you set the version to 3.1 or higher in the constructor.
Steve > -----Original Message----- > From: Martin O'Shea [mailto:app...@dsl.pipex.com] > Sent: Sunday, October 24, 2010 3:59 PM > To: java-user@lucene.apache.org > Subject: Use of hyphens in StandardAnalyzer > > Hello > > > > I have a StandardAnalyzer working which retrieves words and frequencies > from > a single document using a TermVectorMapper which is populating a HashMap. > > > > But if I use the following text as a field in my document, i.e. > > > > addDoc(w, "lucene Lawton-Browne Lucene"); > > > > The word frequencies returned in the HashMap are: > > > > browne 1 > > lucene 2 > > lawton 1 > > > > The problem is the words 'lawton' and 'browne'. If this is an actual > 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where > the > name is actually a single word? > > > > I've tried combinations of: > > > > addDoc(w, "lucene \"Lawton-Browne\" Lucene"); > > > > And single quotes but without success. > > > > Thanks > > > > Martin O'Shea. > > > > > >