I created a test class for rapid testing that should be runnable out of the box, with LUCENE-SUGGEST-5.0-SNAPSHOT (maven) dependency. (see attachment)
Because I can't subclass from the final FuzzySuggester I subclassed AnalyzingSuggester, delegating all 3 method calls 'convertAutomaton, getFullPrefixPaths and getTokenStreamToAutomaton' + constructor over an internal FuzzySuggester member. Then I overrided AnalyzingSuggester.lookup(..) by copying it from AnalyzingSuggester, sadly invoking 2 methods and reading some field members with reflection api because of their private declaration (alternative would be to copy everything). Everything worked as expected so far. I added our slight modification - moving the getFullPrefixPaths invocation to the first prefix path creation. The main class checks a simple scenario with KeywordAnalyzer, three term dictionary and some query term variations. Here is the output. Sadly some (for me) unexpected results: Dictionary: [screen, screensaver, mouse] query: 'screan' - exact result as expected (correct). But not in any case! This is when one letter is changed, which is not the first or last one. Exact results: screen/1 All results: - double entry of 'screen'? screen/1 screen/1 screensaver/1 query: 'screew' - last letter changed: exact result empty (incorrect). Exact results: All results: screen/1 screensaver/1 query: 'wcreen' - first letter changed: nothing found at all. Exact results: All results: query: 'scree' - last letter removed. Exact results: All results: screen/1 screensaver/1 query: 'scren' - 5th letter removed. Same as with last removed letter. Exact results: All results: screen/1 screensaver/1 query: 'sreen' - 2th letter removed. Why different? Exact results: screen/1 All results: - double entry of 'screen'? screen/1 screen/1 screensaver/1 query: 'screen' - correct query: screen not found at all? Exact results: All results: screensaver/1 Now, my latin is at the end (as we say in Germany ;) ). Don't know how to proceed further, as the deeper code starts to become very complex. Thanks a lot! Christian Reuschling On 15.11.2013 18:49, Michael McCandless wrote: > Hmm, I'm not sure offhand why that change gives you no results. > > The fullPrefixPaths should have been a super-set of the original > prefix paths, since the LevA just adds further paths. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling > <christian.reuschl...@gmail.com> wrote: >> I tried it by changing the first prefixPath initialization to >> >> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths = >> FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst); >> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst); >> >> inside AnalyzingSuggester.lookup(..). (simply copied the line from below) >> >> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled >> query. >> >> Correct spelled query: >> prefixPaths size == 1 >> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, >> bytesReader) >> (without getFullPrefixPath: non-null) >> >> Query within edit distance - the same: >> prefixPaths size == 1 (without getFullPrefixPath: 0) >> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, >> bytesReader) >> >> Query outside of edit distance: >> prefixPaths size = 0 >> >> Seems like the fuzziness is there, but getFullPrefixPaths kicks all >> END_BYTEs ? >> >> >> >> On 14.11.2013 17:05, Michael McCandless wrote: >>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling >>> <christian.reuschl...@gmail.com> wrote: >>>> We started to implement a named entity recognition on the base of >>>> AnalyzingSuggester, which >>>> offers the great support for Synonyms, Stopwords, etc. For this, we >>>> slightly modified >>>> AnalyzingSuggester.lookup() to only return the exactFirst hits >>>> (considering the exactFirst >>>> code block only, skipping the 'sameSurfaceForm' check and break, to get >>>> the synonym hits >>>> too). >>>> >>>> This works pretty good, and our next step would be to bring in some >>>> fuzzyness against >>>> spelling mistakes. For this, the idea was to do exactly the same, but with >>>> FuzzySuggester >>>> instead. >>>> >>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only >>>> relies on sharing the >>>> same prefix - also different/misspelled terms inside the edit distance are >>>> considered as 'not >>>> exact', which means we get the same results as with AnalyzingSuggester. >>>> >>>> >>>> query: "screen" misspelled query: "screan" dictionary: "screen", >>>> "screensaver" >>>> >>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on >>>> misspelled query: >>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester >>>> EXACT_FIRST hits on >>>> misspelled query: <empty> >>>> >>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled >>>> query: screen, >>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester >>>> EXACT_FIRST hits on >>>> misspelled query: <empty> => TARGET: screen >>>> >>>> >>>> Is there a possibility to distinguish? I see that the 'exact' criteria >>>> relies on an FST >>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when >>>> building the >>>> Levenshtein automata? I have no clue. >>> >>> It seems like the problem is that AnalyzingSuggester checks for exactFirst >>> before calling >>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the >>> fuzziness)? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> --------------------------------------------------------------------- To >>> unsubscribe, e-mail: >>> java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: >>> java-user-h...@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >
package org.apache.lucene.search.suggest.analyzing; import java.io.IOException; import java.lang.reflect.Field; import java.lang.reflect.InvocationTargetException; import java.lang.reflect.Method; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.Comparator; import java.util.HashSet; import java.util.Iterator; import java.util.List; import java.util.Set; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStreamToAutomaton; import org.apache.lucene.analysis.core.KeywordAnalyzer; import org.apache.lucene.search.suggest.InputIterator.InputIteratorWrapper; import org.apache.lucene.search.suggest.analyzing.FSTUtil.Path; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefIterator; import org.apache.lucene.util.CharsRef; import org.apache.lucene.util.IntsRef; import org.apache.lucene.util.automaton.Automaton; import org.apache.lucene.util.fst.FST; import org.apache.lucene.util.fst.Util; import org.apache.lucene.util.fst.FST.BytesReader; import org.apache.lucene.util.fst.PairOutputs.Pair; import org.apache.lucene.util.fst.Util.MinResult; public class FuzzySuggesterExactMatchOnlyTest extends AnalyzingSuggester { // we can not subclass FuzzySuggester because of its final state... FuzzySuggester m_fuzzySuggester; public FuzzySuggesterExactMatchOnlyTest(Analyzer analyzer) { super(analyzer); m_fuzzySuggester = new FuzzySuggester(analyzer); } public static void main(String[] args) throws Exception { System.out.println("For named entity extraction, the 'exact' results are the only ones of interest.\n" + "For fuzzy match, the goal is to get all Levensthein-matching terms as 'exact' results too."); FuzzySuggesterExactMatchOnlyTest testSuggester = new FuzzySuggesterExactMatchOnlyTest(new KeywordAnalyzer()); List<String> lDictionary = Arrays.asList("screen", "screensaver", "mouse"); System.out.println("\nDictionary: " + lDictionary); final Iterator<String> termIterator = lDictionary.iterator(); testSuggester.build(new InputIteratorWrapper(new BytesRefIterator() { @Override public BytesRef next() throws IOException { if(!termIterator.hasNext()) return null; return new BytesRef(termIterator.next()); } @Override public Comparator<BytesRef> getComparator() { return null; } })); String strQuery = "screan"; System.out.println("\nquery: '" + strQuery + "' - exact result as expected (correct). But not in any case! " + "This is when one letter is changed, which is not the first or last one."); List<LookupResult> results = testSuggester.lookup(strQuery, false, 10000); System.out.println("All results: - double entry of 'screen'?"); for (LookupResult result : results) System.out.println(" " + result); strQuery = "screew"; System.out.println("\nquery: '" + strQuery + "' - last letter changed: exact result empty (incorrect)."); results = testSuggester.lookup(strQuery, false, 10000); System.out.println("All results:"); for (LookupResult result : results) System.out.println(" " + result); strQuery = "wcreen"; System.out.println("\nquery: '" + strQuery + "' - first letter changed: nothing found at all."); results = testSuggester.lookup(strQuery, false, 10000); System.out.println("All results:"); for (LookupResult result : results) System.out.println(" " + result); strQuery = "scree"; System.out.println("\nquery: '" + strQuery + "' - last letter removed."); results = testSuggester.lookup(strQuery, false, 10000); System.out.println("All results:"); for (LookupResult result : results) System.out.println(" " + result); strQuery = "scren"; System.out.println("\nquery: '" + strQuery + "' - 5th letter removed. Same as with last removed letter."); results = testSuggester.lookup(strQuery, false, 10000); System.out.println("All results:"); for (LookupResult result : results) System.out.println(" " + result); strQuery = "sreen"; System.out.println("\nquery: '" + strQuery + "' - 2th letter removed. Why different?"); results = testSuggester.lookup(strQuery, false, 10000); System.out.println("All results: - double entry of 'screen'?"); for (LookupResult result : results) System.out.println(" " + result); strQuery = "screen"; System.out.println("\nquery: '" + strQuery + "' - correct query: screen not found at all?"); results = testSuggester.lookup(strQuery, false, 10000); System.out.println("All results:"); for (LookupResult result : results) System.out.println(" " + result); } // modified lookup method (modifications with NER comments. Access to private super fields/members with reflection API) @Override public List<LookupResult> lookup(CharSequence key, boolean onlyMorePopular, int num) { // NER: copy private member references to 'somethingVisible' members for lookup. Verified: these ones won't be modified further in this // method. makePrivateMembersVisible(); assert num > 0; if(onlyMorePopular) { throw new IllegalArgumentException("this suggester only works with onlyMorePopular=false"); } if(m_fstVisible == null) { return Collections.emptyList(); } // System.out.println("lookup key=" + key + " num=" + num); for (int i = 0; i < key.length(); i++) { if(key.charAt(i) == 0x1E) { throw new IllegalArgumentException("lookup key cannot contain HOLE character U+001E; this character is reserved"); } if(key.charAt(i) == 0x1F) { throw new IllegalArgumentException("lookup key cannot contain unit separator character U+001F; this character is reserved"); } } final BytesRef utf8Key = new BytesRef(key); try { // NER: sadly invocation of private superclass method necessary // Automaton lookupAutomaton = toLookupAutomaton(key); Automaton lookupAutomaton = (Automaton) invokePrivateMethod(this, "toLookupAutomaton", new Object[] { key }); final CharsRef spare = new CharsRef(); // System.out.println(" now intersect exactFirst=" + exactFirst); // Intersect automaton w/ suggest wFST and get all // prefix starting nodes & their outputs: // final PathIntersector intersector = getPathIntersector(lookupAutomaton, fst); // System.out.println(" prefixPaths: " + prefixPaths.size()); BytesReader bytesReader = m_fstVisible.getBytesReader(); FST.Arc<Pair<Long, BytesRef>> scratchArc = new FST.Arc<Pair<Long, BytesRef>>(); final List<LookupResult> results = new ArrayList<LookupResult>(); // List<FSTUtil.Path<Pair<Long, BytesRef>>> prefixPaths = FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), m_fstVisible); // NER: Modification according email from Michael McCandless // prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, m_fst); List<FSTUtil.Path<Pair<Long, BytesRef>>> prefixPaths = getFullPrefixPaths(null, lookupAutomaton, m_fstVisible); if(m_exactFirstVisible) { int count = 0; for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths) { if(m_fstVisible.findTargetArc(m_END_BYTEVisible, path.fstNode, scratchArc, bytesReader) != null) { // This node has END_BYTE arc leaving, meaning it's an // "exact" match: count++; } } // Searcher just to find the single exact only // match, if present: Util.TopNSearcher<Pair<Long, BytesRef>> searcher = new Util.TopNSearcher<Pair<Long, BytesRef>>(m_fstVisible, count * m_maxSurfaceFormsPerAnalyzedFormVisible, count * m_maxSurfaceFormsPerAnalyzedFormVisible, m_weightComparatorVisible); // NOTE: we could almost get away with only using // the first start node. The only catch is if // maxSurfaceFormsPerAnalyzedForm had kicked in and // pruned our exact match from one of these nodes // ...: for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths) { if(m_fstVisible.findTargetArc(m_END_BYTEVisible, path.fstNode, scratchArc, bytesReader) != null) { // This node has END_BYTE arc leaving, meaning it's an // "exact" match: // NER: sadly access of private member necessary Pair<Long, BytesRef> output = (Pair<Long, BytesRef>) getPrivateField(path, "output"); searcher.addStartPaths(scratchArc, m_fstVisible.outputs.add(output, scratchArc.output), false, path.input); } } MinResult<Pair<Long, BytesRef>> completions[] = searcher.search(); // NOTE: this is rather inefficient: we enumerate // every matching "exactly the same analyzed form" // path, and then do linear scan to see if one of // these exactly matches the input. It should be // possible (though hairy) to do something similar // to getByOutput, since the surface form is encoded // into the FST output, so we more efficiently hone // in on the exact surface-form match. Still, I // suspect very little time is spent in this linear // seach: it's bounded by how many prefix start // nodes we have and the // maxSurfaceFormsPerAnalyzedForm: for (MinResult<Pair<Long, BytesRef>> completion : completions) { BytesRef output2 = completion.output.output2; // NER: we want to have the several manifestations/synonyms in the results, so we commented out this final check // boolean bSameSurfaceForm = // (boolean) PrivateAccessor.invokePrivateMethod(this, "sameSurfaceForm", new Object[] { utf8Key, output2 }); // if(bSameSurfaceForm) { // NER: sadly invocation of private superclass method necessary LookupResult lookupResult = (LookupResult) invokePrivateMethod(this, "getLookupResult", new Object[] { completion.output.output1, output2, spare }); results.add(lookupResult); // NER: don't break - we want to have all synonym results // break; } } if(results.size() == num) { // That was quick: return results; } } // NER: these are the exact results. We print them out. These are the only ones of interest System.out.println("Exact results:"); for (LookupResult result : results) System.out.println(" " + result); Util.TopNSearcher<Pair<Long, BytesRef>> searcher; searcher = new Util.TopNSearcher<Pair<Long, BytesRef>>(m_fstVisible, num - results.size(), num * m_maxAnalyzedPathsForOneInputVisible, m_weightComparatorVisible) { private final Set<BytesRef> seen = new HashSet<BytesRef>(); @Override protected boolean acceptResult(IntsRef input, Pair<Long, BytesRef> output) { try { // Dedup: when the input analyzes to a graph we // can get duplicate surface forms: if(seen.contains(output.output2)) { return false; } seen.add(output.output2); if(!m_exactFirstVisible) { return true; } else { // In exactFirst mode, don't accept any paths // matching the surface form since that will // create duplicate results: // NER: sadly invocation of private superclass method necessary boolean bSameSurfaceForm = (boolean) invokePrivateMethod(FuzzySuggesterExactMatchOnlyTest.this, "sameSurfaceForm", new Object[] { utf8Key, output.output2 }); if(bSameSurfaceForm) { // We found exact match, which means we should // have already found it in the first search: assert results.size() == 1; return false; } else { return true; } } } catch (Exception e) { throw new RuntimeException(e); } } }; // NER: since we invoked this above, we don't need this here again // prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, m_fstVisible); for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths) { // NER: sadly access of private member necessary Pair<Long, BytesRef> output = (Pair<Long, BytesRef>) getPrivateField(path, "output"); searcher.addStartPaths(path.fstNode, output, true, path.input); } MinResult<Pair<Long, BytesRef>> completions[] = searcher.search(); for (MinResult<Pair<Long, BytesRef>> completion : completions) { // NER: sadly invocation of private superclass method necessary LookupResult result = (LookupResult) invokePrivateMethod(this, "getLookupResult", new Object[] { completion.output.output1, completion.output.output2, spare }); // TODO: for fuzzy case would be nice to return // how many edits were required // System.out.println(" result=" + result); results.add(result); if(results.size() == num) { // In the exactFirst=true case the search may // produce one extra path break; } } return results; } catch (Exception bogus) { throw new RuntimeException(bogus); } } protected FST<Pair<Long, BytesRef>> m_fstVisible; protected boolean m_exactFirstVisible; protected int m_maxSurfaceFormsPerAnalyzedFormVisible; protected int m_END_BYTEVisible; protected Comparator<Pair<Long, BytesRef>> m_weightComparatorVisible; protected int m_maxAnalyzedPathsForOneInputVisible; // sadly some reflection API reference copies because of private members use @SuppressWarnings("unchecked") protected void makePrivateMembersVisible() { try { m_fstVisible = (FST<Pair<Long, BytesRef>>) getPrivateField(this, "fst"); m_exactFirstVisible = (boolean) getPrivateField(this, "exactFirst"); m_maxSurfaceFormsPerAnalyzedFormVisible = (int) getPrivateField(this, "maxSurfaceFormsPerAnalyzedForm"); m_END_BYTEVisible = (int) getPrivateField(this, "END_BYTE"); m_weightComparatorVisible = (Comparator<Pair<Long, BytesRef>>) getPrivateField(this, "weightComparator"); m_maxAnalyzedPathsForOneInputVisible = (int) getPrivateField(this, "maxAnalyzedPathsForOneInput"); } catch (Exception e) { throw new RuntimeException(e); } } @SuppressWarnings("rawtypes") public static Object invokePrivateMethod(Object alcatrazObject, String methodName, Object[] params) throws IllegalArgumentException, IllegalAccessException, InvocationTargetException, NoSuchMethodException { // Check we have valid arguments... if(alcatrazObject == null || methodName == null || params == null) throw new IllegalStateException("Object, methodName and params must not be null"); // Go and find the private method... Class currentClass = alcatrazObject.getClass(); while (currentClass != null) { final Method methods[] = currentClass.getDeclaredMethods(); for (int i = 0; i < methods.length; ++i) { if(methodName.equals(methods[i].getName())) { methods[i].setAccessible(true); return methods[i].invoke(alcatrazObject, params); } } currentClass = currentClass.getSuperclass(); } throw new NoSuchMethodException("Method '" + methodName + "' not found"); } @SuppressWarnings("rawtypes") public static Object getPrivateField(Object alcatrazObject, String fieldName) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException { // Check we have valid arguments... if(alcatrazObject == null || fieldName == null) throw new IllegalStateException("Object or fieldname must not be null"); // Go and find the private field... Class currentClass = alcatrazObject.getClass(); while (currentClass != null) { final Field fields[] = currentClass.getDeclaredFields(); for (int i = 0; i < fields.length; ++i) { if(fieldName.equals(fields[i].getName())) { fields[i].setAccessible(true); return fields[i].get(alcatrazObject); } } currentClass = currentClass.getSuperclass(); } throw new NoSuchFieldException("Field '" + fieldName + "' not found"); } // delegate methods to internal FuzzySuggester @Override protected Automaton convertAutomaton(Automaton a) { return m_fuzzySuggester.convertAutomaton(a); } @Override protected List<Path<Pair<Long, BytesRef>>> getFullPrefixPaths(List<Path<Pair<Long, BytesRef>>> prefixPaths, Automaton lookupAutomaton, FST<Pair<Long, BytesRef>> fst) throws IOException { return m_fuzzySuggester.getFullPrefixPaths(prefixPaths, lookupAutomaton, fst); } @Override TokenStreamToAutomaton getTokenStreamToAutomaton() { return m_fuzzySuggester.getTokenStreamToAutomaton(); } }
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org