Hmm, I'm not sure offhand why that change gives you no results. The fullPrefixPaths should have been a super-set of the original prefix paths, since the LevA just adds further paths.
Mike McCandless http://blog.mikemccandless.com On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling <christian.reuschl...@gmail.com> wrote: > I tried it by changing the first prefixPath initialization to > > List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths = > FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst); > prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst); > > inside AnalyzingSuggester.lookup(..). (simply copied the line from below) > > Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled > query. > > Correct spelled query: > prefixPaths size == 1 > returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, > bytesReader) > (without getFullPrefixPath: non-null) > > Query within edit distance - the same: > prefixPaths size == 1 (without getFullPrefixPath: 0) > returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, > bytesReader) > > Query outside of edit distance: > prefixPaths size = 0 > > Seems like the fuzziness is there, but getFullPrefixPaths kicks all END_BYTEs > ? > > > > On 14.11.2013 17:05, Michael McCandless wrote: >> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling >> <christian.reuschl...@gmail.com> wrote: >>> We started to implement a named entity recognition on the base of >>> AnalyzingSuggester, which >>> offers the great support for Synonyms, Stopwords, etc. For this, we >>> slightly modified >>> AnalyzingSuggester.lookup() to only return the exactFirst hits (considering >>> the exactFirst >>> code block only, skipping the 'sameSurfaceForm' check and break, to get the >>> synonym hits >>> too). >>> >>> This works pretty good, and our next step would be to bring in some >>> fuzzyness against >>> spelling mistakes. For this, the idea was to do exactly the same, but with >>> FuzzySuggester >>> instead. >>> >>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only >>> relies on sharing the >>> same prefix - also different/misspelled terms inside the edit distance are >>> considered as 'not >>> exact', which means we get the same results as with AnalyzingSuggester. >>> >>> >>> query: "screen" misspelled query: "screan" dictionary: "screen", >>> "screensaver" >>> >>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on >>> misspelled query: >>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester >>> EXACT_FIRST hits on >>> misspelled query: <empty> >>> >>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled >>> query: screen, >>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester >>> EXACT_FIRST hits on >>> misspelled query: <empty> => TARGET: screen >>> >>> >>> Is there a possibility to distinguish? I see that the 'exact' criteria >>> relies on an FST >>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when >>> building the >>> Levenshtein automata? I have no clue. >> >> It seems like the problem is that AnalyzingSuggester checks for exactFirst >> before calling >> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the >> fuzziness)? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> --------------------------------------------------------------------- To >> unsubscribe, e-mail: >> java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: >> java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org