Apache commons codec library has double metaphone algorithm. I tried a
series of experiments around storing the double metaphone
representations of strings in the index itself, and searching using
doublemetaphone version of search terms when the field I am searching
against is stored as double metaphone. This works very well. For my test
rig, I added 4 variants of a field to the document. The four variants
were:
1) name-tokenized-doublemetaphone
2) name-tokenized
3)name-untokenized-doublemetaphone
4)name-untokenized
Here is the code where I wrote added the 4 variants to the index:
private void addProductNamesToDoc(Document poiDocument, IdentityType
id) {
DoubleMetaphone dm = new DoubleMetaphone();
dm.setMaxCodeLen(100);
for(Object name: id.getNames().getPOIName()){ //for each name in
list of names. Name can be "SCHAAD FAMILY ALMONDS" for example
if(log.isDebugEnabled())log.debug(((POINameType)name).getText());
if(null != ((POINameType)name).getText()){
String[] splits =
((POINameType)name).getText().split("\\s"); //tokenize manually. (gosh,
I thought the analyser would do this)
//add tokenized double metaphone and plain tokenized
variants of name
for(String component:splits){
poiDocument.add(new
Field("name-tokenized-doublemetaphone",dm.doubleMetaphone(component),
Field.Store.YES, Field.Index.ANALYZED));
poiDocument.add(new
Field("name-tokenized",component, Field.Store.YES,
Field.Index.ANALYZED));
}
//add untokenized double metaphone and untokenized plain
poiDocument.add(new
Field("name-untokenized-doublemetaphone",dm.doubleMetaphone(((POINameTyp
e)name).getText()), Field.Store.YES, Field.Index.ANALYZED));
poiDocument.add(new
Field("name-untokenized",((POINameType)name).getText(), Field.Store.YES,
Field.Index.ANALYZED));
}
}
}
Results of testing misspelled terms with PhraseQuery show that only
name-tokenized-doublemetaphone can tolerate misspellings.So this seems
to be a nice and efficient way to allow inputs that are wildly
misspelled.
Can someone explain to me exactly what Field.Store.YES and
Field.Index.ANALYZED do? Should I tune these values?
Geoff Hendrey
Software Architect
deCarta
Four North Second Street, Suite 950
San Jose, CA 95113
office 408.625.3522
www.decarta.com <blocked::http://www.decarta.com>