RE: bi-grams for common terms - any analyzers do that?

Jonathan Rochkind Sat, 25 Sep 2010 17:25:15 -0700

Huh, okay, I didn't know that #2 happened at all. Can you explain or point me 
to documentation to explain when it happens?  I'm afraid I'm having trouble 
understanding <<  if the analyzer returns more than one position back from a 
"queryparser token" (whitespace). >>

Not entirely sure what that means.  Can you give an example?

As much as the query parser pre-tokenization is a problem in many cases (for me 
too), I'm not sure if dismax could happen without some pre-tokenization, 
doesn't it need that so it can combine the scores of the individual words by 
"maximum disjunction" -- it's got to know what the individual terms are, if 
it's going to dismax combine them, no?  

I'm not sure if "the queryparser forms a phrase query without explicit phrase 
quotes" is a problem for me, I had no idea it happened until now, never 
noticed, and still don't really understand in what circumstances it happens. 

Jonathan
________________________________________
From: Robert Muir [rcm...@gmail.com]
Sent: Saturday, September 25, 2010 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: bi-grams for common terms - any analyzers do that?

On Sat, Sep 25, 2010 at 10:33 AM, Jonathan Rochkind <rochk...@jhu.edu>wrote:

> Wow, I never heard of autoGeneratePhraseQueries before. Is there any
> documentation of what it does?
>
> My initial reaction is being confused because this sounds kind of like the
> opposite of hte original issue. The original issue is that the query parsers
> are splitting on whitespace _before_ they give tokens to the field
> analyzers.  The query parsers actually do this only with queries that are
> NOT explicit phrase queries.  I woudln't call this behavior "automatically
> generating phrase queries" exactly, and wouldn't expect that turning off
> "automatic generating of phrase queries" would prevent the pre-tokenization
> by the query parser.  But... it does somehow?
>

this is in reference to Tom's comment on his "l'art" problem (
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance
 ).

so, there are two problems:
1. that the queryparser "pre-tokenizes" on whitespace at all.
2. that the queryparser forms a phrase query, if the analyzer returns more
than one position back from a "queryparser token" (whitespace).

turning off autoGeneratePhraseQueries only solves problem #2, because its
not appropriate for many languages. Before this option (e.g. Solr 1.4.x),
you had to use the PositionFilter to workaround this problem. But
PositionFilter simply "flattens/stacks" the positions (makes it seem as if
they are all synonyms). With PositionFilter you couldn't have phrase queries
at all... and you don't get a BooleanQuery coordination factor.

with autoGeneratePhraseQueries=false, you won't get a phrase query unless it
was in double quotes... its just that simple.

fixing problem #1 alltogether, is the way to go. Because then the
tokenization would be left to the analyzer completely, and you would have a
lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605

--
Robert Muir
rcm...@gmail.com

RE: bi-grams for common terms - any analyzers do that?

Reply via email to