Our business has a need to allow for multiple values for a single field. For example, we have an index of employers where an employer often has multiple ways people refer to it. For example, the company "Wal-mart" is referred to as:
1) Wal-mart 2) Wal-mart Stores 3) Walmart I would like a search for any of these 3 terms to match the Wal-mart employer. I've tried two different approaches for this. Approach 1: Create multiple values for the same field. So the document has these three fields: 1) name=Wal-mart 2) name=Wal-mart Stores 3) name=Walmart The problem with this is Lucene seems to treat the 3 different fields as one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic b/c term frequencies is 2 when a user searches for "Wal-mart". Approach 2: Create different named fields for each value so the document has these 3 fields: 1) name1=Wal-mart 2) name2=Wal-mart Stores 3) name3=Walmart This fixes the issue above but introduces a different problem. The idf calculation is incorrect b/c idf is calculated per field. Most employers only have one name or maybe 2 names. So the name3 fields idf ends up being much higher b/c there are fewer docs with a given term in the name3 field. For now, I'm going with approach 2 but overriding the IndexReader. IndexReader.docFreq(Term t) method always returns the doc frequency from the name1 field even if the Term t is actually for name2 or name3, etc. But this doesn't feel like a clean solution. Any suggestions on how to deal with this? Any ideas would be appreciated. Ryan Aylward