I created a test class for rapid testing that should be runnable out of the 
box, with
LUCENE-SUGGEST-5.0-SNAPSHOT (maven) dependency. (see attachment)

Because I can't subclass from the final FuzzySuggester I subclassed 
AnalyzingSuggester, delegating
all 3 method calls 'convertAutomaton, getFullPrefixPaths and 
getTokenStreamToAutomaton' +
constructor over an internal FuzzySuggester member.

Then I overrided AnalyzingSuggester.lookup(..) by copying it from 
AnalyzingSuggester, sadly invoking
2 methods and reading some field members with reflection api because of their 
private declaration
(alternative would be to copy everything). Everything worked as expected so far.

I added our slight modification - moving the getFullPrefixPaths invocation to 
the first prefix path
creation.

The main class checks a simple scenario with KeywordAnalyzer, three term 
dictionary and some query
term variations.
Here is the output. Sadly some (for me) unexpected results:


Dictionary: [screen, screensaver, mouse]

query: 'screan' - exact result as expected (correct). But not in any case! This 
is when one letter
is changed, which is not the first or last one.
Exact results:
  screen/1
All results: - double entry of 'screen'?
  screen/1
  screen/1
  screensaver/1

query: 'screew' - last letter changed: exact result empty (incorrect).
Exact results:
All results:
  screen/1
  screensaver/1

query: 'wcreen' - first letter changed: nothing found at all.
Exact results:
All results:

query: 'scree' - last letter removed.
Exact results:
All results:
  screen/1
  screensaver/1

query: 'scren' - 5th letter removed. Same as with last removed letter.
Exact results:
All results:
  screen/1
  screensaver/1

query: 'sreen' - 2th letter removed. Why different?
Exact results:
  screen/1
All results: - double entry of 'screen'?
  screen/1
  screen/1
  screensaver/1

query: 'screen' - correct query: screen not found at all?
Exact results:
All results:
  screensaver/1



Now, my latin is at the end (as we say in Germany ;) ). Don't know how to 
proceed further, as the
deeper code starts to become very complex.

Thanks a lot!

Christian Reuschling



On 15.11.2013 18:49, Michael McCandless wrote:
> Hmm, I'm not sure offhand why that change gives you no results.
> 
> The fullPrefixPaths should have been a super-set of the original
> prefix paths, since the LevA just adds further paths.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling
> <christian.reuschl...@gmail.com> wrote:
>> I tried it by changing the first prefixPath initialization to
>>
>> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
>>     FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
>> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
>>
>> inside AnalyzingSuggester.lookup(..). (simply copied the line from below)
>>
>> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled 
>> query.
>>
>> Correct spelled query:
>> prefixPaths size == 1
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, 
>> bytesReader)
>>   (without getFullPrefixPath: non-null)
>>
>> Query within edit distance - the same:
>> prefixPaths size == 1   (without getFullPrefixPath: 0)
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, 
>> bytesReader)
>>
>> Query outside of edit distance:
>> prefixPaths size = 0
>>
>> Seems like the fuzziness is there, but getFullPrefixPaths kicks all 
>> END_BYTEs ?
>>
>>
>>
>> On 14.11.2013 17:05, Michael McCandless wrote:
>>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling 
>>> <christian.reuschl...@gmail.com> wrote:
>>>> We started to implement a named entity recognition on the base of 
>>>> AnalyzingSuggester, which
>>>> offers the great support for Synonyms, Stopwords, etc. For this, we 
>>>> slightly modified
>>>> AnalyzingSuggester.lookup() to only return the exactFirst hits 
>>>> (considering the exactFirst
>>>> code block only, skipping the 'sameSurfaceForm' check and break, to get 
>>>> the synonym hits
>>>> too).
>>>>
>>>> This works pretty good, and our next step would be to bring in some 
>>>> fuzzyness against
>>>> spelling mistakes. For this, the idea was to do exactly the same, but with 
>>>> FuzzySuggester
>>>> instead.
>>>>
>>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only 
>>>> relies on sharing the
>>>> same prefix - also different/misspelled terms inside the edit distance are 
>>>> considered as 'not
>>>> exact', which means we get the same results as with AnalyzingSuggester.
>>>>
>>>>
>>>> query: "screen" misspelled query: "screan" dictionary: "screen", 
>>>> "screensaver"
>>>>
>>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on 
>>>> misspelled query:
>>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester 
>>>> EXACT_FIRST hits on
>>>> misspelled query: <empty>
>>>>
>>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled 
>>>> query: screen,
>>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester 
>>>> EXACT_FIRST hits on
>>>> misspelled query: <empty> => TARGET: screen
>>>>
>>>>
>>>> Is there a possibility to distinguish? I see that the 'exact' criteria 
>>>> relies on an FST
>>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when 
>>>> building the
>>>> Levenshtein automata? I have no clue.
>>>
>>> It seems like the problem is that AnalyzingSuggester checks for exactFirst 
>>> before calling
>>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the 
>>> fuzziness)?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> --------------------------------------------------------------------- To 
>>> unsubscribe, e-mail:
>>> java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:
>>> java-user-h...@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
package org.apache.lucene.search.suggest.analyzing;



import java.io.IOException;
import java.lang.reflect.Field;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStreamToAutomaton;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.search.suggest.InputIterator.InputIteratorWrapper;
import org.apache.lucene.search.suggest.analyzing.FSTUtil.Path;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefIterator;
import org.apache.lucene.util.CharsRef;
import org.apache.lucene.util.IntsRef;
import org.apache.lucene.util.automaton.Automaton;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.Util;
import org.apache.lucene.util.fst.FST.BytesReader;
import org.apache.lucene.util.fst.PairOutputs.Pair;
import org.apache.lucene.util.fst.Util.MinResult;



public class FuzzySuggesterExactMatchOnlyTest extends AnalyzingSuggester
{

    // we can not subclass FuzzySuggester because of its final state...
    FuzzySuggester m_fuzzySuggester;



    public FuzzySuggesterExactMatchOnlyTest(Analyzer analyzer)
    {
        super(analyzer);

        m_fuzzySuggester = new FuzzySuggester(analyzer);
    }





    public static void main(String[] args) throws Exception
    {

        System.out.println("For named entity extraction, the 'exact' results are the only ones of interest.\n"
                + "For fuzzy match, the goal is to get all Levensthein-matching terms as 'exact' results too.");
        FuzzySuggesterExactMatchOnlyTest testSuggester = new FuzzySuggesterExactMatchOnlyTest(new KeywordAnalyzer());

        List<String> lDictionary = Arrays.asList("screen", "screensaver", "mouse");
        System.out.println("\nDictionary: " + lDictionary);
        final Iterator<String> termIterator = lDictionary.iterator();

        testSuggester.build(new InputIteratorWrapper(new BytesRefIterator()
        {

            @Override
            public BytesRef next() throws IOException
            {
                if(!termIterator.hasNext()) return null;

                return new BytesRef(termIterator.next());
            }



            @Override
            public Comparator<BytesRef> getComparator()
            {
                return null;
            }
        }));



        String strQuery = "screan";
        System.out.println("\nquery: '" + strQuery + "' - exact result as expected (correct). But not in any case! "
                + "This is when one letter is changed, which is not the first or last one.");
        List<LookupResult> results = testSuggester.lookup(strQuery, false, 10000);

        System.out.println("All results: - double entry of 'screen'?");
        for (LookupResult result : results)
            System.out.println("  " + result);
        
        strQuery = "screew";
        System.out.println("\nquery: '" + strQuery + "' - last letter changed: exact result empty (incorrect).");
        results = testSuggester.lookup(strQuery, false, 10000);

        System.out.println("All results:");
        for (LookupResult result : results)
            System.out.println("  " + result);
        
        
        strQuery = "wcreen";
        System.out.println("\nquery: '" + strQuery + "' - first letter changed: nothing found at all.");
        results = testSuggester.lookup(strQuery, false, 10000);

        System.out.println("All results:");
        for (LookupResult result : results)
            System.out.println("  " + result);
        
        
        strQuery = "scree";
        System.out.println("\nquery: '" + strQuery + "' - last letter removed.");
        results = testSuggester.lookup(strQuery, false, 10000);

        System.out.println("All results:");
        for (LookupResult result : results)
            System.out.println("  " + result);
        
        strQuery = "scren";
        System.out.println("\nquery: '" + strQuery + "' - 5th letter removed. Same as with last removed letter.");
        results = testSuggester.lookup(strQuery, false, 10000);

        System.out.println("All results:");
        for (LookupResult result : results)
            System.out.println("  " + result);
        
        
        strQuery = "sreen";
        System.out.println("\nquery: '" + strQuery + "' - 2th letter removed. Why different?");
        results = testSuggester.lookup(strQuery, false, 10000);

        System.out.println("All results: - double entry of 'screen'?");
        for (LookupResult result : results)
            System.out.println("  " + result);


        strQuery = "screen";
        System.out.println("\nquery: '" + strQuery + "' - correct query: screen not found at all?");
        results = testSuggester.lookup(strQuery, false, 10000);

        System.out.println("All results:");
        for (LookupResult result : results)
            System.out.println("  " + result);

    }



    // modified lookup method (modifications with NER comments. Access to private super fields/members with reflection API)

    @Override
    public List<LookupResult> lookup(CharSequence key, boolean onlyMorePopular, int num)
    {
        // NER: copy private member references to 'somethingVisible' members for lookup. Verified: these ones won't be modified further in this
        // method.
        makePrivateMembersVisible();

        assert num > 0;

        if(onlyMorePopular)
        {
            throw new IllegalArgumentException("this suggester only works with onlyMorePopular=false");
        }
        if(m_fstVisible == null)
        {
            return Collections.emptyList();
        }

        // System.out.println("lookup key=" + key + " num=" + num);
        for (int i = 0; i < key.length(); i++)
        {
            if(key.charAt(i) == 0x1E)
            {
                throw new IllegalArgumentException("lookup key cannot contain HOLE character U+001E; this character is reserved");
            }
            if(key.charAt(i) == 0x1F)
            {
                throw new IllegalArgumentException("lookup key cannot contain unit separator character U+001F; this character is reserved");
            }
        }
        final BytesRef utf8Key = new BytesRef(key);
        try
        {

            // NER: sadly invocation of private superclass method necessary
            // Automaton lookupAutomaton = toLookupAutomaton(key);
            Automaton lookupAutomaton = (Automaton) invokePrivateMethod(this, "toLookupAutomaton", new Object[] { key });

            final CharsRef spare = new CharsRef();

            // System.out.println("  now intersect exactFirst=" + exactFirst);

            // Intersect automaton w/ suggest wFST and get all
            // prefix starting nodes & their outputs:
            // final PathIntersector intersector = getPathIntersector(lookupAutomaton, fst);

            // System.out.println("  prefixPaths: " + prefixPaths.size());

            BytesReader bytesReader = m_fstVisible.getBytesReader();

            FST.Arc<Pair<Long, BytesRef>> scratchArc = new FST.Arc<Pair<Long, BytesRef>>();

            final List<LookupResult> results = new ArrayList<LookupResult>();


            // List<FSTUtil.Path<Pair<Long, BytesRef>>> prefixPaths = FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), m_fstVisible);
            // NER: Modification according email from Michael McCandless
            // prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, m_fst);
            List<FSTUtil.Path<Pair<Long, BytesRef>>> prefixPaths = getFullPrefixPaths(null, lookupAutomaton, m_fstVisible);

            if(m_exactFirstVisible)
            {

                int count = 0;
                for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths)
                {
                    if(m_fstVisible.findTargetArc(m_END_BYTEVisible, path.fstNode, scratchArc, bytesReader) != null)
                    {
                        // This node has END_BYTE arc leaving, meaning it's an
                        // "exact" match:
                        count++;
                    }
                }

                // Searcher just to find the single exact only
                // match, if present:
                Util.TopNSearcher<Pair<Long, BytesRef>> searcher =
                        new Util.TopNSearcher<Pair<Long, BytesRef>>(m_fstVisible, count * m_maxSurfaceFormsPerAnalyzedFormVisible, count
                                * m_maxSurfaceFormsPerAnalyzedFormVisible, m_weightComparatorVisible);

                // NOTE: we could almost get away with only using
                // the first start node. The only catch is if
                // maxSurfaceFormsPerAnalyzedForm had kicked in and
                // pruned our exact match from one of these nodes
                // ...:
                for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths)
                {
                    if(m_fstVisible.findTargetArc(m_END_BYTEVisible, path.fstNode, scratchArc, bytesReader) != null)
                    {
                        // This node has END_BYTE arc leaving, meaning it's an
                        // "exact" match:
                        // NER: sadly access of private member necessary
                        Pair<Long, BytesRef> output = (Pair<Long, BytesRef>) getPrivateField(path, "output");
                        searcher.addStartPaths(scratchArc, m_fstVisible.outputs.add(output, scratchArc.output), false, path.input);
                    }
                }

                MinResult<Pair<Long, BytesRef>> completions[] = searcher.search();

                // NOTE: this is rather inefficient: we enumerate
                // every matching "exactly the same analyzed form"
                // path, and then do linear scan to see if one of
                // these exactly matches the input. It should be
                // possible (though hairy) to do something similar
                // to getByOutput, since the surface form is encoded
                // into the FST output, so we more efficiently hone
                // in on the exact surface-form match. Still, I
                // suspect very little time is spent in this linear
                // seach: it's bounded by how many prefix start
                // nodes we have and the
                // maxSurfaceFormsPerAnalyzedForm:
                for (MinResult<Pair<Long, BytesRef>> completion : completions)
                {
                    BytesRef output2 = completion.output.output2;

                    // NER: we want to have the several manifestations/synonyms in the results, so we commented out this final check
                    // boolean bSameSurfaceForm =
                    // (boolean) PrivateAccessor.invokePrivateMethod(this, "sameSurfaceForm", new Object[] { utf8Key, output2 });
                    // if(bSameSurfaceForm)
                    {
                        // NER: sadly invocation of private superclass method necessary
                        LookupResult lookupResult =
                                (LookupResult) invokePrivateMethod(this, "getLookupResult",
                                        new Object[] { completion.output.output1, output2, spare });
                        results.add(lookupResult);
                        // NER: don't break - we want to have all synonym results
                        // break;
                    }
                }

                if(results.size() == num)
                {
                    // That was quick:
                    return results;
                }
            }


            // NER: these are the exact results. We print them out. These are the only ones of interest
            System.out.println("Exact results:");
            for (LookupResult result : results)
                System.out.println("  " + result);




            Util.TopNSearcher<Pair<Long, BytesRef>> searcher;
            searcher =
                    new Util.TopNSearcher<Pair<Long, BytesRef>>(m_fstVisible, num - results.size(), num * m_maxAnalyzedPathsForOneInputVisible,
                            m_weightComparatorVisible)
                    {
                        private final Set<BytesRef> seen = new HashSet<BytesRef>();



                        @Override
                        protected boolean acceptResult(IntsRef input, Pair<Long, BytesRef> output)
                        {

                            try
                            {
                                // Dedup: when the input analyzes to a graph we
                                // can get duplicate surface forms:
                                if(seen.contains(output.output2))
                                {
                                    return false;
                                }
                                seen.add(output.output2);

                                if(!m_exactFirstVisible)
                                {
                                    return true;
                                }
                                else
                                {
                                    // In exactFirst mode, don't accept any paths
                                    // matching the surface form since that will
                                    // create duplicate results:
                                    // NER: sadly invocation of private superclass method necessary
                                    boolean bSameSurfaceForm =
                                            (boolean) invokePrivateMethod(FuzzySuggesterExactMatchOnlyTest.this, "sameSurfaceForm", new Object[] {
                                                    utf8Key, output.output2 });
                                    if(bSameSurfaceForm)
                                    {
                                        // We found exact match, which means we should
                                        // have already found it in the first search:
                                        assert results.size() == 1;
                                        return false;
                                    }
                                    else
                                    {
                                        return true;
                                    }
                                }
                            }
                            catch (Exception e)
                            {
                                throw new RuntimeException(e);

                            }
                        }
                    };

            // NER: since we invoked this above, we don't need this here again
            // prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, m_fstVisible);

            for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths)
            {
                // NER: sadly access of private member necessary
                Pair<Long, BytesRef> output = (Pair<Long, BytesRef>) getPrivateField(path, "output");
                searcher.addStartPaths(path.fstNode, output, true, path.input);
            }

            MinResult<Pair<Long, BytesRef>> completions[] = searcher.search();

            for (MinResult<Pair<Long, BytesRef>> completion : completions)
            {
                // NER: sadly invocation of private superclass method necessary
                LookupResult result =
                        (LookupResult) invokePrivateMethod(this, "getLookupResult", new Object[] { completion.output.output1,
                                completion.output.output2, spare });

                // TODO: for fuzzy case would be nice to return
                // how many edits were required

                // System.out.println("    result=" + result);
                results.add(result);

                if(results.size() == num)
                {
                    // In the exactFirst=true case the search may
                    // produce one extra path
                    break;
                }
            }

            return results;
        }
        catch (Exception bogus)
        {
            throw new RuntimeException(bogus);
        }

    }




    protected FST<Pair<Long, BytesRef>> m_fstVisible;

    protected boolean m_exactFirstVisible;

    protected int m_maxSurfaceFormsPerAnalyzedFormVisible;

    protected int m_END_BYTEVisible;

    protected Comparator<Pair<Long, BytesRef>> m_weightComparatorVisible;

    protected int m_maxAnalyzedPathsForOneInputVisible;





    // sadly some reflection API reference copies because of private members use
    @SuppressWarnings("unchecked")
    protected void makePrivateMembersVisible()
    {
        try
        {
            m_fstVisible = (FST<Pair<Long, BytesRef>>) getPrivateField(this, "fst");
            m_exactFirstVisible = (boolean) getPrivateField(this, "exactFirst");
            m_maxSurfaceFormsPerAnalyzedFormVisible = (int) getPrivateField(this, "maxSurfaceFormsPerAnalyzedForm");
            m_END_BYTEVisible = (int) getPrivateField(this, "END_BYTE");
            m_weightComparatorVisible = (Comparator<Pair<Long, BytesRef>>) getPrivateField(this, "weightComparator");
            m_maxAnalyzedPathsForOneInputVisible = (int) getPrivateField(this, "maxAnalyzedPathsForOneInput");

        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }

    }



    @SuppressWarnings("rawtypes")
    public static Object invokePrivateMethod(Object alcatrazObject, String methodName, Object[] params) throws IllegalArgumentException,
            IllegalAccessException, InvocationTargetException, NoSuchMethodException
    {
        // Check we have valid arguments...
        if(alcatrazObject == null || methodName == null || params == null)
            throw new IllegalStateException("Object, methodName and params must not be null");


        // Go and find the private method...


        Class currentClass = alcatrazObject.getClass();

        while (currentClass != null)
        {

            final Method methods[] = currentClass.getDeclaredMethods();

            for (int i = 0; i < methods.length; ++i)
            {
                if(methodName.equals(methods[i].getName()))
                {
                    methods[i].setAccessible(true);
                    return methods[i].invoke(alcatrazObject, params);
                }
            }

            currentClass = currentClass.getSuperclass();
        }

        throw new NoSuchMethodException("Method '" + methodName + "' not found");
    }



    @SuppressWarnings("rawtypes")
    public static Object getPrivateField(Object alcatrazObject, String fieldName) throws IllegalArgumentException, IllegalAccessException,
            NoSuchFieldException
    {
        // Check we have valid arguments...
        if(alcatrazObject == null || fieldName == null) throw new IllegalStateException("Object or fieldname must not be null");


        // Go and find the private field...


        Class currentClass = alcatrazObject.getClass();

        while (currentClass != null)
        {
            final Field fields[] = currentClass.getDeclaredFields();

            for (int i = 0; i < fields.length; ++i)
            {
                if(fieldName.equals(fields[i].getName()))
                {
                    fields[i].setAccessible(true);

                    return fields[i].get(alcatrazObject);
                }
            }

            currentClass = currentClass.getSuperclass();
        }

        throw new NoSuchFieldException("Field '" + fieldName + "' not found");
    }







    // delegate methods to internal FuzzySuggester

    @Override
    protected Automaton convertAutomaton(Automaton a)
    {
        return m_fuzzySuggester.convertAutomaton(a);
    }



    @Override
    protected List<Path<Pair<Long, BytesRef>>> getFullPrefixPaths(List<Path<Pair<Long, BytesRef>>> prefixPaths, Automaton lookupAutomaton,
            FST<Pair<Long, BytesRef>> fst) throws IOException
    {
        return m_fuzzySuggester.getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
    }



    @Override
    TokenStreamToAutomaton getTokenStreamToAutomaton()
    {
        return m_fuzzySuggester.getTokenStreamToAutomaton();
    }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to