[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

Mark Miller (JIRA) Wed, 26 May 2010 14:40:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871961#action_12871961
 ]


Mark Miller commented on LUCENE-2458:
-------------------------------------

{quote}
How about making the setting ("if analyzer returns more than 1 token for a
single chunk of whitespace-separated text, make a PhraseQuery")
configurable (instead of hardwired according to Version)? And defaulting it
to off for Version >= 31 (so CJK, etc., work out of the box)?
{quote}

I think its pretty clear this would make most people happy.

Personally, I'm somewhat on board with Robert that this may really hamstring us 
when it comes to further fixes that are needed/wanted in the future.

To note though - I think in general, most who have commented on this issue are 
into making CJK work out of the box. But I really think we need to nail down 
more consensus on this first.

At a minimum, I think making the behavior configurable, while defaulting to CJK 
'betterness' by default has pretty much everyone on board.

But I'd really like to discuss whether doing that will only lead to losing that 
option as we do things like stop qp from splitting on whitespace in the 
future...

Something I was thinking, and it might be more of a maintenance headache than 
its worth, but we could demote this queryparser from the core query parser, and 
rename it something like ClassicQueryParser (or whatever), and make a new 
QueryParser that is better for more languages across the board (originally 
basing it on the classic parser eg this patch to start). People that like the 
older more english biased QueryParser can still use it, and by default, new 
users will likely pick up the default QueryParser that works better with more 
languages out of the box?

Just an idea.

In any event - I think this patch is a step forward too - but it looks to me 
like there are still open concerns and objections.

> queryparser makes all CJK queries phrase queries regardless of analyzer
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch
>
>
> The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
> ... queries into phrase queries, even though you didn't ask for one, and 
> there isn't a way to turn this off.
> This completely breaks lucene for these languages, as it treats all queries 
> like 'grep'.
> Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
> chinese characters, you get a phrasequery of "a b c d". if you use cjk 
> analyzer, its no better, its a phrasequery of  "ab bc cd", and if you use 
> smartchinese analyzer, you get a phrasequery like "ab cd". But the user 
> didn't ask for one, and they cannot turn it off.
> The reason is that the code to form phrase queries is not internationally 
> appropriate and assumes whitespace tokenization. If more than one token comes 
> out of whitespace delimited text, its automatically a phrase query no matter 
> what.
> The proposed patch fixes the core queryparser (with all backwards compat 
> kept) to only form phrase queries when the double quote operator is used. 
> Implementing subclasses can always extend the QP and auto-generate whatever 
> kind of queries they want that might completely break search for languages 
> they don't care about, but core general-purpose QPs should be language 
> independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

Reply via email to