Re: bi-grams for common terms - any analyzers do that?

Robert Muir Sat, 25 Sep 2010 07:59:13 -0700

On Sat, Sep 25, 2010 at 10:33 AM, Jonathan Rochkind <rochk...@jhu.edu>wrote:


> Wow, I never heard of autoGeneratePhraseQueries before. Is there any
> documentation of what it does?
>
> My initial reaction is being confused because this sounds kind of like the
> opposite of hte original issue. The original issue is that the query parsers
> are splitting on whitespace _before_ they give tokens to the field
> analyzers.  The query parsers actually do this only with queries that are
> NOT explicit phrase queries.  I woudln't call this behavior "automatically
> generating phrase queries" exactly, and wouldn't expect that turning off
> "automatic generating of phrase queries" would prevent the pre-tokenization
> by the query parser.  But... it does somehow?
>

this is in reference to Tom's comment on his "l'art" problem (
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance
 ).

so, there are two problems:
1. that the queryparser "pre-tokenizes" on whitespace at all.
2. that the queryparser forms a phrase query, if the analyzer returns more
than one position back from a "queryparser token" (whitespace).

turning off autoGeneratePhraseQueries only solves problem #2, because its
not appropriate for many languages. Before this option (e.g. Solr 1.4.x),
you had to use the PositionFilter to workaround this problem. But
PositionFilter simply "flattens/stacks" the positions (makes it seem as if
they are all synonyms). With PositionFilter you couldn't have phrase queries
at all... and you don't get a BooleanQuery coordination factor.

with autoGeneratePhraseQueries=false, you won't get a phrase query unless it
was in double quotes... its just that simple.

fixing problem #1 alltogether, is the way to go. Because then the
tokenization would be left to the analyzer completely, and you would have a
lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605

-- 
Robert Muir
rcm...@gmail.com

Re: bi-grams for common terms - any analyzers do that?

Reply via email to