[ 
https://issues.apache.org/jira/browse/SOLR-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222461#comment-13222461
 ] 

Robert Muir commented on SOLR-2660:
-----------------------------------

I think this could be a good option (in combination with shingles as 
mentioned), to accelerate 
the phrase queries that solr query parsers generate in order to boost closer 
matches.

Again the idea is to omit positions entirely, and instead use shinglefilter 
(unigrams and bigrams), approximating phrase 
queries with n-gram conjunctions. I think for the sloppy case, we should use an 
n-gram disjunction, perhaps interpreting 
slop factor as minNrShouldmatch?

This basically means you are substituting levenshtein distance for an n-gram 
approximation in both cases.

In general its a classic indexing/search tradeoff, in my tests on wikipedia 
indexing takes ~ twice as long with the shingles,
but the tradeoff is that for a lot of these use cases you don't need to consult 
the positions file at all.

As a parameter to the fieldtype its easily pluggable without messing with any 
queryparsers, and ordinary queries (term, boolean, etc)
are totally 'pass-thru', *however* the thing I guess I don't like about this 
patch is the fact that this is really a different 
'query intent', in other words, I think its a perfect approach when you just 
want to boost scores of close matches 
(e.g. when generated by dismax queryparser), but when your 'intent' is to 
actually limit matches to a phrase 
(e.g. when keyed in by a user directly), then this approximation isn't as good 
of a fit.

Either way I'm open to other opinions before doing anything (if we decide to do 
it, next step I think is to update the patch with 
the SloppyPhraseQuery approximation).

                
> omitPositions improvements
> --------------------------
>
>                 Key: SOLR-2660
>                 URL: https://issues.apache.org/jira/browse/SOLR-2660
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 3.3, 4.0
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-2660.patch
>
>
> followup to LUCENE-2048:
> Adds factory methods getPhraseQuery/getMultiPhraseQuery to QP, this way you 
> can subclass it and customize behavior, particularly
> * by default, Solr throws exception here if the fieldtype omits positions: 
> rather than 3.x's silent failure of no results, and even for trunk its nicer 
> to fail during query parsing rather than waiting for lucene's failure during 
> execution.
> * adds phraseAsBoolean, which allows you to downgrade these 
> phrase/multiphrase queries to boolean queries: this is a nice option in 
> conjunction with our word n-gram filters (shingle/commongrams/etc)for a fast 
> "approximation", if your application can tolerate some false positives, e.g. 
> "foo bar" -> termQuery(foo_bar), "foo bar baz" -> BQ(foo_bar AND bar_baz)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to