A good suggestion. But I'm using Lucene 3.0.2 and the constructor for a StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1 is due?
-----Original Message----- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: 24 Oct 2010 21 31 To: java-user@lucene.apache.org Subject: RE: Use of hyphens in StandardAnalyzer Hi Martin, StandardTokenizer and -Analyzer have been changed, as of future version 3.1 (the next release) to support the Unicode segmentation rules in UAX#29. My (untested) guess is that your hyphenated word will be kept as a single token if you set the version to 3.1 or higher in the constructor. Steve > -----Original Message----- > From: Martin O'Shea [mailto:app...@dsl.pipex.com] > Sent: Sunday, October 24, 2010 3:59 PM > To: java-user@lucene.apache.org > Subject: Use of hyphens in StandardAnalyzer > > Hello > > > > I have a StandardAnalyzer working which retrieves words and frequencies > from > a single document using a TermVectorMapper which is populating a HashMap. > > > > But if I use the following text as a field in my document, i.e. > > > > addDoc(w, "lucene Lawton-Browne Lucene"); > > > > The word frequencies returned in the HashMap are: > > > > browne 1 > > lucene 2 > > lawton 1 > > > > The problem is the words 'lawton' and 'browne'. If this is an actual > 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where > the > name is actually a single word? > > > > I've tried combinations of: > > > > addDoc(w, "lucene \"Lawton-Browne\" Lucene"); > > > > And single quotes but without success. > > > > Thanks > > > > Martin O'Shea. > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org