Re: FuzzySuggester EXACT_FIRST criteria

Michael McCandless Fri, 15 Nov 2013 09:50:58 -0800

Hmm, I'm not sure offhand why that change gives you no results.

The fullPrefixPaths should have been a super-set of the original
prefix paths, since the LevA just adds further paths.


Mike McCandless

http://blog.mikemccandless.com


On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling
<[email protected]> wrote:
> I tried it by changing the first prefixPath initialization to
>
> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
>     FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
>
> inside AnalyzingSuggester.lookup(..). (simply copied the line from below)
>
> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled 
> query.
>
> Correct spelled query:
> prefixPaths size == 1
> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, 
> bytesReader)
>   (without getFullPrefixPath: non-null)
>
> Query within edit distance - the same:
> prefixPaths size == 1   (without getFullPrefixPath: 0)
> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, 
> bytesReader)
>
> Query outside of edit distance:
> prefixPaths size = 0
>
> Seems like the fuzziness is there, but getFullPrefixPaths kicks all END_BYTEs 
> ?
>
>
>
> On 14.11.2013 17:05, Michael McCandless wrote:
>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling 
>> <[email protected]> wrote:
>>> We started to implement a named entity recognition on the base of 
>>> AnalyzingSuggester, which
>>> offers the great support for Synonyms, Stopwords, etc. For this, we 
>>> slightly modified
>>> AnalyzingSuggester.lookup() to only return the exactFirst hits (considering 
>>> the exactFirst
>>> code block only, skipping the 'sameSurfaceForm' check and break, to get the 
>>> synonym hits
>>> too).
>>>
>>> This works pretty good, and our next step would be to bring in some 
>>> fuzzyness against
>>> spelling mistakes. For this, the idea was to do exactly the same, but with 
>>> FuzzySuggester
>>> instead.
>>>
>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only 
>>> relies on sharing the
>>> same prefix - also different/misspelled terms inside the edit distance are 
>>> considered as 'not
>>> exact', which means we get the same results as with AnalyzingSuggester.
>>>
>>>
>>> query: "screen" misspelled query: "screan" dictionary: "screen", 
>>> "screensaver"
>>>
>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on 
>>> misspelled query:
>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester 
>>> EXACT_FIRST hits on
>>> misspelled query: <empty>
>>>
>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled 
>>> query: screen,
>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester 
>>> EXACT_FIRST hits on
>>> misspelled query: <empty> => TARGET: screen
>>>
>>>
>>> Is there a possibility to distinguish? I see that the 'exact' criteria 
>>> relies on an FST
>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when 
>>> building the
>>> Levenshtein automata? I have no clue.
>>
>> It seems like the problem is that AnalyzingSuggester checks for exactFirst 
>> before calling
>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the 
>> fuzziness)?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> --------------------------------------------------------------------- To 
>> unsubscribe, e-mail:
>> [email protected] For additional commands, e-mail:
>> [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: FuzzySuggester EXACT_FIRST criteria

Reply via email to