[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891945#action_12891945
 ] 

Uwe Schindler commented on LUCENE-2458:
---

QP has now a public final void setAutoGeneratePhraseQueries(boolean value). The 
default value is the one coming from Version parameter, but you can easily 
change it (like e.g. me, often working with such type of product numbers but 
never CJK text) can easily use this behaviour. Lucene's problem is, that it 
does not take position for scoring into account, so documents where the tokens 
appear next to each other do not score higher (in contract to google, which 
supports those combined tokens).

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891969#action_12891969
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. Lucene's problem is, that it does not take position for scoring into 
account, so documents where the tokens appear next to each other do not score 
higher (in contract to google, which supports those combined tokens).

Actually no, the problem is, this autogeneration never really worked anyway, it 
was broken from the beginning.
e.g. 
http://www.lucidimagination.com/search/document/bacf34995067e3cb/worddelimiterfilter_and_phrase_queries


 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891985#action_12891985
 ] 

Yonik Seeley commented on LUCENE-2458:
--

I've reverted just the change of default behavior to Solr's QP.
There are too many negative side-effects to change this given the way Solr is 
currently used (and documented to behave).
We need to work on (at a minimum) a per-field config for Solr, but it seems 
like per-token is still the right way long term.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891990#action_12891990
 ] 

Robert Muir commented on LUCENE-2458:
-

Please revert.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891993#action_12891993
 ] 

Robert Muir commented on LUCENE-2458:
-

The patch doesnt change solr's default, it instead causes SolrQueryParser to 
respect the version parameter in the solrconfig in *both* ctors.
Before, one ctor used the version specified, the other hardcoded LUCENE_24.

As i said before, this shouldnt and cannot be per-token and such english 
centric hacks do not belong in the analysis api.

Separately, I think what Koji is doing on SOLR-2015 is the way to go, not 
hardcoding LUCENE_24 as a version despite what is in the config.

I don't think this english-centric hack should be the default for Solr. It 
*does* completely respect old schemas and is completely backwards compatible,
such that if you have no version in your schema it will be LUCENE_24 and get 
the old behavior, 
if you made your own queryparser and subclassed the old API, you get the old 
behavior, and it respects the version set in the solrconfig rather than 
overriding it to 2.4 in just one ctor.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891998#action_12891998
 ] 

Yonik Seeley commented on LUCENE-2458:
--

bq.  Before, one ctor used the version specified, the other hardcoded LUCENE_24.

Ah and that constructor is the one that's used everywhere in Solr (leading 
me to believe that leaving Solr's default alone was deliberate).

bq. As i said before, this shouldnt and cannot be per-token and such english 
centric hacks do not belong in the analysis api.

The ability of a filter to say this token is actually indexed as two adjacent 
tokens is fundamental and not related to any specific language.
It can be *used* for language specific hacks perhaps... but it is not a hack 
itself.

I never mentioned issues of back compat, but of changes to Solr's default 
behavior, which I continue to think is the best.
I think the best way forward is to add a CJK field to solr that defaults to the 
opposite behavior (i.e. treats split tokens as completely separate).

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892001#action_12892001
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. I think the best way forward is to add a CJK field to solr that defaults to 
the opposite behavior (i.e. treats split tokens as completely separate).

I think this is completely wrong (besides its way more than CJK affected)

You should consider a european field instead.

Furthermore, you should check if its set to index with omitTF and not 
autogenerate in that case either.
In trunk I think phrasequery will actually throw an exception in this case 
instead of silently failing: 
so the autogenerated queries can be very dangerous even for english.


 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892002#action_12892002
 ] 

Robert Muir commented on LUCENE-2458:
-

Please stop committing all these wrong changes.

Now i have to go revert 2 more commits.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892003#action_12892003
 ] 

Yonik Seeley commented on LUCENE-2458:
--

bq. Furthermore, you should check if its set to index with omitTF and not 
autogenerate in that case either.

Solr doesn't currently allow ommitting TF for text fields.
It's good to keep in mind if we ever enable that though.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892005#action_12892005
 ] 

Yonik Seeley commented on LUCENE-2458:
--

Robert, it was your commit that changed the default behavior of Solr, and I 
disagree with that change.
Technically, I could VETO - but I don't believe I have ever done a code-change 
veto, and I don't want to start now.
Instead, I'll try and be constructive by going to work on SOLR-2015 so we can 
at least configure it per-field.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892006#action_12892006
 ] 

Uwe Schindler commented on LUCENE-2458:
---

It's a new major version! Even 2 steps major 1.5 - 3.1 and three steps to 4.0

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891916#action_12891916
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. Perhaps we should switch the SolrQueryParser back to using 
version==LUCENE_24 (or LUCENE_29 would work too)?

I dont think we should do this. the whole *point* of this issue was that this 
auto-generation is a bad default, e.g. *every* thai query is a phrase query.
I agree with Koji's idea of adding a config hook for autoGeneratePhraseQueries 
for those that want it though, but i don't think it should be on by default 
either.


 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-07-23 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891917#action_12891917
 ] 

Koji Sekiguchi commented on LUCENE-2458:


bq. I agree with Koji's idea of adding a config hook for 
autoGeneratePhraseQueries for those that want it though, but i don't think it 
should be on by default either.

Thanks. I'll open a ticket for it.


 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-06-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883531#action_12883531
 ] 

Robert Muir commented on LUCENE-2458:
-

ok, will commit this in a few days.


 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch, 
 LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873030#action_12873030
 ] 

Yonik Seeley commented on LUCENE-2458:
--

{quote}
True, but I thought there was something about dealing with this via subclassing 
you didnt like?
With the current patch (with no option at all) you could do this per-field 
behavior with subclassing already:
{quote}

True... I'm  fine with subclassing - I guess the only diff is if the default is 
configurable or set only via version.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873034#action_12873034
 ] 

Yonik Seeley commented on LUCENE-2458:
--

bq. In this case, too, you could override the default with subclassing.

True - but I think some were saying the default should be configurable w/o 
subclassing.

bq. but we can add an explicit boolean option for those that don't subclass?

Right.  I think everyone is essentially saying the same thing at this point (at 
the high level)?
Make it configurable (per-parser), and allow the user to handle per-field 
variations via subclassing.


 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873035#action_12873035
 ] 

Robert Muir commented on LUCENE-2458:
-

ok i revised the idea here:

in 3.1 we supply a patch that looks like this one, except, there is a simple 
boolean toggle too. 
if you want per-field behavior or more explicit customization, you can subclass.
the simple toggle is for non-subclassers.

in 4.0 we do the same thing, for now.

we open a separate issue where we replace the QP with something better (that 
does not split on whitespace 
at all and allows multi-word syns, n-gram tokenization, vietnamese, etc to work)

as part of that issue take the existing one (with per-field toggle) and call it 
classicqueryparser or whatever.


 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871574#action_12871574
 ] 

Mark Miller commented on LUCENE-2458:
-

For all the debate around this change, that was a pretty fast commit IMO ...

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871581#action_12871581
 ] 

Mark Miller commented on LUCENE-2458:
-

I know there was more discussion on this in IRC, but I don't see consensus in 
the issue. I also don't see the issues brought up having been addressed or 
worked out.

I've got to -1 this commit. I even think I may be convinced that making this an 
option will make future improvements we may want too difficult - but nothing 
has been hammered out in this JIRA issue. It looks like those that have brought 
up various points have just been ignored.

-1.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871582#action_12871582
 ] 

Uwe Schindler commented on LUCENE-2458:
---

Revert! Revert! Revert!

By the way, matchVersion should be final. I also like to have a separate setter 
for the auto-phrase functionality. That should be easy possible!

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871625#action_12871625
 ] 

Koji Sekiguchi commented on LUCENE-2458:


+1 to revert. Though I am a late comer (as always :(  ) and I just read the 
updated Description, the example behavior of QueryParser for CJK (abcd - ab 
bc cd) looks correct to me and I'm using QP with CJK as is.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871642#action_12871642
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. I've got to -1 this commit.

As mentioned on apache's website:
{code}
To prevent vetos from being used capriciously, they must be accompanied by a 
technical justification showing why the change is bad (opens a security 
exposure, negatively affects performance, etc.). A veto without a justification 
is invalid and has no weight.
{code}

No one has been able to provide any technical justifications, only subjective 
opinions.

When standard test collections were used, it was shown that this behavior 
significant hurts CJK and delivers only 10% of standard IR techniques (not 
generating phrases but using boolean word/bigram queries). See Ivan's results 
above. This isn't surprising since CJK IR has been pretty well studied, there 
is nothing new here.

At the same time, when english test collections were used, there was no 
difference, on the contrary, it only tended to slightly improve relevance for 
english, too.

Why do we even bother trying to start an openrelevance project if people do not 
want to go with the scientific method but prefer subjective opinion?



 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871737#action_12871737
 ] 

Mark Miller commented on LUCENE-2458:
-

I think we should continue working out what's best here.

I don't think its wise to try and bully through contentious issues. There 
should be consensus before something happens here - barring that, some kind of 
vote makes sense IMO.




 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871738#action_12871738
 ] 

Yonik Seeley commented on LUCENE-2458:
--

Let's remember that the bug is  queryparser makes all CJK queries phrase 
queries regardless of analyzer.
The ability of an analysis chain to create phrase queries in conjunction with 
the query parser is a feature.
The obvious way to reconcile these opposing statements is to make it 
configurable.
Per-parser is a bare minimum... per-field would be better... and per-token 
would be best.

I won't repeat my previous comments in this thread - however those arguments 
are still valid.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-26 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871961#action_12871961
 ] 

Mark Miller commented on LUCENE-2458:
-

{quote}
How about making the setting (if analyzer returns more than 1 token for a
single chunk of whitespace-separated text, make a PhraseQuery)
configurable (instead of hardwired according to Version)? And defaulting it
to off for Version = 31 (so CJK, etc., work out of the box)?
{quote}

I think its pretty clear this would make most people happy.

Personally, I'm somewhat on board with Robert that this may really hamstring us 
when it comes to further fixes that are needed/wanted in the future.

To note though - I think in general, most who have commented on this issue are 
into making CJK work out of the box. But I really think we need to nail down 
more consensus on this first.

At a minimum, I think making the behavior configurable, while defaulting to CJK 
'betterness' by default has pretty much everyone on board.

But I'd really like to discuss whether doing that will only lead to losing that 
option as we do things like stop qp from splitting on whitespace in the 
future...

Something I was thinking, and it might be more of a maintenance headache than 
its worth, but we could demote this queryparser from the core query parser, and 
rename it something like ClassicQueryParser (or whatever), and make a new 
QueryParser that is better for more languages across the board (originally 
basing it on the classic parser eg this patch to start). People that like the 
older more english biased QueryParser can still use it, and by default, new 
users will likely pick up the default QueryParser that works better with more 
languages out of the box?

Just an idea.

In any event - I think this patch is a step forward too - but it looks to me 
like there are still open concerns and objections.

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer

2010-05-25 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871501#action_12871501
 ] 

Robert Muir commented on LUCENE-2458:
-

This patch fixes the bug in all queryparsers. I plan to commit soon.

If desired, someone can make their own euro-centric queryparser in the contrib 
section and I have no objection, as long as its clearly documented that its 
unsuitable for many languages (just like the JDK does).

 queryparser makes all CJK queries phrase queries regardless of analyzer
 ---

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch


 The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, 
 ... queries into phrase queries, even though you didn't ask for one, and 
 there isn't a way to turn this off.
 This completely breaks lucene for these languages, as it treats all queries 
 like 'grep'.
 Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are 
 chinese characters, you get a phrasequery of a b c d. if you use cjk 
 analyzer, its no better, its a phrasequery of  ab bc cd, and if you use 
 smartchinese analyzer, you get a phrasequery like ab cd. But the user 
 didn't ask for one, and they cannot turn it off.
 The reason is that the code to form phrase queries is not internationally 
 appropriate and assumes whitespace tokenization. If more than one token comes 
 out of whitespace delimited text, its automatically a phrase query no matter 
 what.
 The proposed patch fixes the core queryparser (with all backwards compat 
 kept) to only form phrase queries when the double quote operator is used. 
 Implementing subclasses can always extend the QP and auto-generate whatever 
 kind of queries they want that might completely break search for languages 
 they don't care about, but core general-purpose QPs should be language 
 independent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org