Add the common terms such as "University", "School", "Medicine", "Institute" etc. to stopwords list, so you are left with Stanford, "Palo Alto" etc.
Then use Ahmet's suggestion of using a booleanquery .setMinimumNumberShouldMatch() to (say) 75% of the query string length. Finally, if you wish to be very precise, you can loop through the hits collector and use a string comparison algorithm like Jaro-Winkler, Levenstein etc. for a second-level filter. On Tue, Mar 23, 2010 at 5:08 PM, Aaron Schon <aaron_sc...@yahoo.com> wrote: > hi all, I have been playing with Lucene for a while now, but stuck on a > perplexing issue. > > I have an index, with a field "Affiliation", some example values are: > > - "Stanford University School of Medicine, Palo Alto, CA USA", > - "Institute of Neurobiology, School of Medicine, Stanford University, Palo > Alto, CA", > - "School of Medicine, Harvard University, Boston MA", > - "Brigham & Women's, Harvard University School of Medicine, Boston, MA" > - "Harvard University, Cambridge MA" > > and so on... (the bottom-line being the affiliations are written in multiple > ways with no apparent consistency) > > I query the index on the affiliation field using say "School of Medicine, > Stanford University, Palo Alto, CA" (with QueryParser) to find all Stanford > related documents, I get a lot of false +ves, presumably because of the > presence of School of Medicine etc. etc. (note: I cannot use Phrase query > because of variability in the way affiliation is constructed) > > I have tried the following: > > 1. Use a SpanNearQuery by splitting the search phrase with a whitespace (here > I get no results!) > 2. Tried boosting (using ^) by splitting with the comma and boosting the last > parts such as "Palo Alto CA" with a much higher boost than the initial > phrases. Here I still get lots of false +ves. > > Any suggestions on how to approach this? Is SpanNear the way to go? Any other > ideas on why I get 0 results? > > Thanks in advance for helping a newbie. > > AS > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org