[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

Jim Ferenczi (JIRA) Mon, 04 Jun 2018 12:48:10 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500766#comment-16500766
 ]


Jim Ferenczi commented on LUCENE-8344:
--------------------------------------

{quote}
org.apache.lucene.search.suggest.document.TestPrefixCompletionQuery#testAnalyzerWithSepAndNoPreservePos
 see "test trailing stopword with a new document"
{quote}

If you index with preservePositionIncrements=false you cannot match a query 
that preserves the position increments and contains a stop word. This is 
expected. "baz the" indexed with preservePositionIncrements=false cannot match 
the query "baz the" if you preserve the position increments. However it should 
work if you query "baz" with and without preserving the pos increment. This is 
why I said that the completion field (and all the related queries) should be 
fine with this change. It works without reindexing.

{quote}
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest#testStandard 
see the "round trip" test
With BUG==true: fails (bad for back-compat)
With BUG==false: passes (therefore a reindex fixes)
{quote}

This one is more tricky because it tries to find exact match first so the 
indexed version and the query version should be the same otherwise the 
assertion line 789 of the AnalyzingSuggester fails. We can probably fix the 
discrepancy by adding a BWC layer that removes the trailing POS_SEP of the 
indexed version when sameSurfaceForm is called and preservePosInc is false ? 
WDYT ? 
This would remove the need to rebuild the FST on a version that contains the 
fix.



> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8344
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8344
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/suggest
>            Reporter: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

Reply via email to