Our business has a need to allow for multiple values for a single field. For 
example, we have an index of employers where an employer often has multiple 
ways people refer to it. For example, the company "Wal-mart" is referred to as:

1)      Wal-mart

2)      Wal-mart Stores

3)      Walmart
I would like a search for any of these 3 terms to match the Wal-mart employer.

I've tried two different approaches for this.

Approach 1: Create multiple values for the same field. So the document has 
these three fields:

1)      name=Wal-mart

2)      name=Wal-mart Stores

3)      name=Walmart
The problem with this is Lucene seems to treat the 3 different fields as one 
long field of "Wal-mart Wal-mart Stores Walmart". This is problematic b/c term 
frequencies is 2 when a user searches for "Wal-mart".

Approach 2: Create different named fields for each value so the document has 
these 3 fields:

1)      name1=Wal-mart

2)      name2=Wal-mart Stores

3)      name3=Walmart
This fixes the issue above but introduces a different problem. The idf 
calculation is incorrect b/c idf is calculated per field. Most employers only 
have one name or maybe 2 names. So the name3 fields idf ends up being much 
higher b/c there are fewer docs with a given term in the name3 field.

For now, I'm going with approach 2 but overriding the IndexReader. 
IndexReader.docFreq(Term t) method always returns the doc frequency from the 
name1 field even if the Term t is actually for name2 or name3, etc. But this 
doesn't feel like a clean solution.

Any suggestions on how to deal with this? Any ideas would be appreciated.
Ryan Aylward

Reply via email to