I've seen discussions about using the double metaphone algorithm with Lucene (basically: like soundex, used
to find works that sound similar in English at least) but couldn't find an implementation, so I spent
a few minutes and wrote a Query and TermEnum object for this. I may have missed the prior art so sorry if I did...
[1] Here are some mail msgs that mention double metaphone wrt Lucene:
http://www.geocrawler.com/archives/3/2626/2000/10/0/4566951/ http://www.geocrawler.com/archives/3/2626/2001/8/50/6382300/ http://www.mail-archive.com/[EMAIL PROTECTED]/msg04648.html
[2] And Phoenix has a double metaphone Analyzer, but not a Query, which I guess is another angle on things:
http://www.tangentum.biz/en/products/phonetix/api/com/tangentum/phonetix/lucene/PhoneticAnalyzer.html
[3] Attached are 2 files (DoubleMetaPhoneQuery and DoubleMetaphoneTermEnum) that I think are valid contributions
to the Lucene Sandbox. Hopefully all that has to be done is change the package line if the powers that be accept this.
Note: My impl uses the Jakarta CODEC package ( http://jakarta.apache.org/commons/codec/ ) for the double metaphone algorithm implementation.
Also, any query expansion such as this could exceed the bounds of a boolean query, thus BooleanQuery.setMaxClauseCount
may need to be used to avoid an exception.
[4] I've updated my Lucene demo site which has the ~3500 RFCs indexed and searchable by Lucene. I added an "advanced query"
page to try out the DoubleMetaphoneQuery:
It's a few lines down at this URL:
http://www.hostmon.com/rfc/advanced.jsp
[5] Most of the above is redundantly stated here as a kind of perma-link:
http://www.tropo.com/techno/java/lucene/metaphone.html
[6]
While it's easy to write additonal Query classes, I suspect they are a kind of dead end and won't really be
used unless they are integrated into the QueryParser - thus one concept is that the Lucene syntax should
have some extension mechanism so you can pass a query like "metaphone::protokal" to it and "metaphone::"
(note the double colons) would mean to use DoubleMetaphoneQuery for this term. Maybe an extensible query parser
should be the subject of another email?
So: let me know if this is useful and plz enter it into the sandbox...
thx, Dave Spencer
package com.tropo.lucene;
/* ==================================================================== * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by the * Apache Software Foundation (http://www.apache.org/)." * Alternately, this acknowledgment may appear in the software itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and * "Apache Lucene" must not be used to endorse or promote products * derived from this software without prior written permission. For * written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called "Apache", * "Apache Lucene", nor may "Apache" appear in their name, without * prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ==================================================================== * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * <http://www.apache.org/>. */ import java.io.IOException; import org.apache.lucene.search.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; import org.apache.lucene.document.*; /** A Query that matches documents containing terms with a specified prefix. */ public final class DoubleMetaphoneQuery extends MultiTermQuery { public DoubleMetaphoneQuery(Term term) { super(term); } protected FilteredTermEnum getEnum(IndexReader reader) throws IOException { return new DoubleMetaphoneTermEnum(reader, getTerm()); } public String toString(String field) { return super.toString(field); // FIXME: what to do here } }
package com.tropo.lucene; /* ==================================================================== * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by the * Apache Software Foundation (http://www.apache.org/)." * Alternately, this acknowledgment may appear in the software itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and * "Apache Lucene" must not be used to endorse or promote products * derived from this software without prior written permission. For * written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called "Apache", * "Apache Lucene", nor may "Apache" appear in their name, without * prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ==================================================================== * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * <http://www.apache.org/>. */ import org.apache.lucene.search.*; import java.io.IOException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.commons.codec.language.*; /** Subclass of FilteredTermEnum for enumerating all terms that are similiar to the specified filter term. <p>Term enumerations are always ordered by Term.compareTo(). Each term in the enumeration is greater than all that precede it. */ public final class DoubleMetaphoneTermEnum extends FilteredTermEnum { private int del_len; boolean endEnum = false; Term searchTerm = null; String field = ""; String text = ""; int textlen; final DoubleMetaphone m = new DoubleMetaphone(); final String goal1; String goal2; public DoubleMetaphoneTermEnum(IndexReader reader, Term term) throws IOException { super(reader, term); searchTerm = term; field = searchTerm.field(); text = searchTerm.text(); textlen = text.length(); goal1 = m.doubleMetaphone( text, true); goal2 = m.doubleMetaphone( text, false); if ( goal1.equals( goal2)) goal2 = null; setEnum(reader.terms(new Term(searchTerm.field(), ""))); } /** The termCompare method in DoubleMetaphoneTermEnum uses ... */ protected final boolean termCompare(Term term) { if (field == term.field()) { String s = term.text(); String try1 = m.doubleMetaphone( s, true); String try2 = m.doubleMetaphone( s, false); if ( try1.equals( goal1)) return true; if ( try2.equals( goal1)) return true; if ( goal2 != null) { if ( try1.equals( goal2)) return true; if ( try2.equals( goal2)) return true; } return false; } endEnum = true; return false; } protected final float difference() { return (float) 1.0; // assume all terms that sound alike are equally valuable... } public final boolean endEnum() { return endEnum; } public void close() throws IOException { super.close(); searchTerm = null; field = null; text = null; } }
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]