Robert - is the effect on scoring also on English and other European
languages? Or is it mostly for ngram-based languages, and especially CJK?

I want to stress that not all ngram-based languages are affected by this
behavior, especially those for which we do ngram just because of a lack of
good tokenizer.

That's why I'm not sure the default should be changed and I'm all for a
getter/setter. If however it turns out the default MUST be changed, then I
support the Version + getter/setter approach.

Shai

On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) <j...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410]
>
> Uwe Schindler commented on LUCENE-2458:
> ---------------------------------------
>
> Hi Robert,
>
> I also agree with Mark (as you know). We can have both:
> - Version for a good default (3.1 will get the new non-phrase-query
> behavior)
> - A separate getsetter for this option
> (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
>
> This would give you the best from both worlds.
>
> > queryparser shouldn't generate phrasequeries based on term count
> > ----------------------------------------------------------------
> >
> >                 Key: LUCENE-2458
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
> >             Project: Lucene - Java
> >          Issue Type: Bug
> >          Components: QueryParser
> >            Reporter: Robert Muir
> >            Assignee: Robert Muir
> >            Priority: Blocker
> >             Fix For: 3.1, 4.0
> >
> >         Attachments: LUCENE-2458.patch, LUCENE-2458.patch
> >
> >
> > The current method in the queryparser to generate phrasequeries is wrong:
> > The Query Syntax documentation (
> http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> > {noformat}
> > A Phrase is a group of words surrounded by double quotes such as "hello
> dolly".
> > {noformat}
> > But as we know, this isn't actually true.
> > Instead the terms are first divided on whitespace, then the analyzer term
> count is used as some sort of "heuristic" to determine if its a phrase query
> or not.
> > This assumption is a disaster for languages that don't use whitespace
> separation: CJK, compounding European languages like German, Finnish, etc.
> It also
> > makes it difficult for people to use n-gram analysis techniques. In these
> cases you get bad relevance (MAP improves nearly *10x* if you use a
> PositionFilter at query-time to "turn this off" for chinese).
> > For even english, this undocumented behavior is bad. Perhaps in some
> cases its being abused as some heuristic to "second guess" the tokenizer and
> piece back things it shouldn't have split, but for large collections, doing
> things like generating phrasequeries because StandardTokenizer split a
> compound on a dash can cause serious performance problems. Instead people
> should analyze their text with the appropriate methods, and QueryParser
> should only generate phrase queries when the syntax asks for one.
> > The PositionFilter in contrib can be seen as a workaround, but its pretty
> obscure and people are not familiar with it. The result is we have bad
> out-of-box behavior for many languages, and bad performance for others on
> some inputs.
> > I propose instead that we change the grammar to actually look for double
> quotes to determine when to generate a phrase query, consistent with the
> documentation.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to