Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Sun, May 23, 2010 at 2:39 PM, Uwe Schindler wrote: > In 3.1 we dont remove QP from core, so lets fix it there. In 4.0 we can > possibly have a totally new QP with no backwards at all, so no problem. As > Earwin noted in another reply: His comment is the way QP should work, but > this is hard to do with analyzers in front (I would have some ideas). But > then we must do it without crappy javacc or jflex and with Analyzer only + > some self-coded stuff only. I don't want to "give up" on fixing this second whitespace problem until 4.0. Mostly because it prevents Lucene from having proper n-gram search where n-grams are word-spanning (this is the classic case that is well understood and gives good results for lots of Euro languages). So this is why i was against this option, because I want to fix this later too, and not make it impossible. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
> On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler wrote: > > I just want to make the feature accessible and documented without > Version. > > I think it is just a bug (a shoddy implementation that does not use the > syntax, > whether it was quoted or not, since this has been thrown away). In this > implementation no one thought about languages that don't use whitespace > and that it would make all queries into phrasequeries. I am happy with your API changes that have the additional param "quoted", but based on that we should also have the Boolean switch to preserve the old behavior with non-CJK-or-other-non-whitespace changes. We could also add a method to TokenStream/Analyzer that is "isTokenizingOnWhitespace" (don’t read this seriously!). > I really do not think this sort of code belongs inside core lucene, if you > want > to make uninternationalized code in your own code base that is not correct > that is fine. Me affects the issue not at all, I (and also Shai) use own query parsers that do exactly that (and my code is using mixed language, preferabely English-only with some foreign fragments) and no CJK and other stuff. So here you are correct, I only see thousands of issues and bug reports. Because in western European world, uses build phrases the way I said (and that is what was behind the code you don’t like). > Furthermore by preserving this kind of bug it makes the queryparser more > complicated, and especially in the future. If at some point in the future you > want to really have the QP not split on whitespace (as you yourself said on > the issue you want) to enable support for multi-word synonyms and "real" n- > grams at querytime, I hope you understand this buggy code conflicts and > complicates this later goal. In 3.1 we dont remove QP from core, so lets fix it there. In 4.0 we can possibly have a totally new QP with no backwards at all, so no problem. As Earwin noted in another reply: His comment is the way QP should work, but this is hard to do with analyzers in front (I would have some ideas). But then we must do it without crappy javacc or jflex and with Analyzer only + some self-coded stuff only. Let's drink some beer and think about it (too bad that I am out of Czech now). Possibly at Berlin! :-) Uwe - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
+1, this is what the patch does. I agree i did a crappy job explaining the issue. On Sun, May 23, 2010 at 2:25 PM, Shai Erera wrote: > So ... after a long IRC chat on this, I think this has just been worded > incorrectly (the issue). As I understand, there are two issues here: > 1) QP loses a phrase info for fields -- the query f:"abcd" and f:abcd are > parsed the same, or handled the same. There is no way for the one extending > QP to tell if quotes were used. > 2) QP has a default impl for f:abcd which is not international-friendly. > > I agree (1) should be fixed, and I apologize if I missed that previously. > Version is the right way to go with this. > > About (2), I think that if f:abcd is submitted, then a PQ should not be > created. The user hasn't asked for it. But if f:"abcd" was submitted, then > it is ok to create a PQ by default. And we're only talking about defaults > here. Anyone should be able to extend QP and override the relevant > getFieldQuery variant and do whatever he wants. > > If the question on what should be the default behavior for (2), then I think > pending Version, it should create a PQ for f:"abcd" only. And we leave it to > the extended to determine what should be his right behavior. > > Shai > > On Sun, May 23, 2010 at 9:09 PM, Robert Muir wrote: >> >> On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler wrote: >> > I just want to make the feature accessible and documented without >> > Version. >> >> I think it is just a bug (a shoddy implementation that does not use >> the syntax, whether it was quoted or not, since this has been thrown >> away). In this implementation no one thought about languages that >> don't use whitespace and that it would make all queries into >> phrasequeries. >> >> I really do not think this sort of code belongs inside core lucene, if >> you want to make uninternationalized code in your own code base that >> is not correct that is fine. >> >> Furthermore by preserving this kind of bug it makes the queryparser >> more complicated, and especially in the future. If at some point in >> the future you want to really have the QP not split on whitespace (as >> you yourself said on the issue you want) to enable support for >> multi-word synonyms and "real" n-grams at querytime, I hope you >> understand this buggy code conflicts and complicates this later goal. >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
So ... after a long IRC chat on this, I think this has just been worded incorrectly (the issue). As I understand, there are two issues here: 1) QP loses a phrase info for fields -- the query f:"abcd" and f:abcd are parsed the same, or handled the same. There is no way for the one extending QP to tell if quotes were used. 2) QP has a default impl for f:abcd which is not international-friendly. I agree (1) should be fixed, and I apologize if I missed that previously. Version is the right way to go with this. About (2), I think that if f:abcd is submitted, then a PQ should not be created. The user hasn't asked for it. But if f:"abcd" was submitted, then it is ok to create a PQ by default. And we're only talking about defaults here. Anyone should be able to extend QP and override the relevant getFieldQuery variant and do whatever he wants. If the question on what should be the default behavior for (2), then I think pending Version, it should create a PQ for f:"abcd" only. And we leave it to the extended to determine what should be his right behavior. Shai On Sun, May 23, 2010 at 9:09 PM, Robert Muir wrote: > On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler wrote: > > I just want to make the feature accessible and documented without > Version. > > I think it is just a bug (a shoddy implementation that does not use > the syntax, whether it was quoted or not, since this has been thrown > away). In this implementation no one thought about languages that > don't use whitespace and that it would make all queries into > phrasequeries. > > I really do not think this sort of code belongs inside core lucene, if > you want to make uninternationalized code in your own code base that > is not correct that is fine. > > Furthermore by preserving this kind of bug it makes the queryparser > more complicated, and especially in the future. If at some point in > the future you want to really have the QP not split on whitespace (as > you yourself said on the issue you want) to enable support for > multi-word synonyms and "real" n-grams at querytime, I hope you > understand this buggy code conflicts and complicates this later goal. > > -- > Robert Muir > rcm...@gmail.com > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler wrote: > I just want to make the feature accessible and documented without Version. I think it is just a bug (a shoddy implementation that does not use the syntax, whether it was quoted or not, since this has been thrown away). In this implementation no one thought about languages that don't use whitespace and that it would make all queries into phrasequeries. I really do not think this sort of code belongs inside core lucene, if you want to make uninternationalized code in your own code base that is not correct that is fine. Furthermore by preserving this kind of bug it makes the queryparser more complicated, and especially in the future. If at some point in the future you want to really have the QP not split on whitespace (as you yourself said on the issue you want) to enable support for multi-word synonyms and "real" n-grams at querytime, I hope you understand this buggy code conflicts and complicates this later goal. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
> The QP should work like that: > (1) It parses the query, creating fragments > (2) It does some out-of-the-box handling of those fragments > > People should be able to override that handling of fragments. But people > should not touch (1). In fact QP should work like that: (1) Tokenizer parses the query as if it was a string of text. Care must be taken to preserve query language operators, as this stage essentially replaces current QP's lexer stage. (2) QP's syntax parser kicks in, identifies operators (those that Tokenizer didn't treat as a part of word tokens) and does overridable out-of-the-box handling for them and tokens around them. The point is - it's hard to do correctly. That's why Lucene resorts to upside-down approach. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Yes I understand that. But because of this it is still not a bug, it is a "feature" (and also implemented like that) to build phrase queries without Quotes, e.g. by simply appending works with ASCII-hyphens (for most European analyzers). And exactly to preserve this behavior, lets simply switch it on/of using a getsetter. That’s all I want, really. I know you are right and I still want to drink beer with you in Berlin and not being killed :-) I just want to make the feature accessible and documented without Version. The idea behind Version would be contradicted. Also the "feature" would go in 4.0. That’s all and I hope that you understand my argument. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Sunday, May 23, 2010 6:43 PM > To: dev@lucene.apache.org > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't > generate phrasequeries based on term count > > These comments lead me to believe you don't understand the issue. > > Do you understand that *ALL* CJK queries are made into phrase queries, > regardless of tokenizer?!!?!?! > > On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler > wrote: > > Same here, as already noted in the issue. > > > > > > > > Uwe > > > > > > > > - > > > > Uwe Schindler > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > http://www.thetaphi.de > > > > eMail: u...@thetaphi.de > > > > > > > > From: Shai Erera [mailto:ser...@gmail.com] > > Sent: Sunday, May 23, 2010 6:34 PM > > > > To: dev@lucene.apache.org > > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't > > generate phrasequeries based on term count > > > > > > > > Robert - is the effect on scoring also on English and other European > > languages? Or is it mostly for ngram-based languages, and especially CJK? > > > > I want to stress that not all ngram-based languages are affected by > > this behavior, especially those for which we do ngram just because of > > a lack of good tokenizer. > > > > That's why I'm not sure the default should be changed and I'm all for > > a getter/setter. If however it turns out the default MUST be changed, > > then I support the Version + getter/setter approach. > > > > Shai > > > > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) > > > > wrote: > > > >[ > > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.j > > ira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=128 > > 70410#action_12870410 > > ] > > > > Uwe Schindler commented on LUCENE-2458: > > --- > > > > Hi Robert, > > > > I also agree with Mark (as you know). We can have both: > > - Version for a good default (3.1 will get the new non-phrase-query > > behavior) > > - A separate getsetter for this option > > (set/getCreatePhraseQueryOnConcenattedTerms or whatever) > > > > This would give you the best from both worlds. > > > >> queryparser shouldn't generate phrasequeries based on term count > >> > >> > >> Key: LUCENE-2458 > >> URL: > >> https://issues.apache.org/jira/browse/LUCENE-2458 > >> Project: Lucene - Java > >> Issue Type: Bug > >> Components: QueryParser > >>Reporter: Robert Muir > >>Assignee: Robert Muir > >>Priority: Blocker > >> Fix For: 3.1, 4.0 > >> > >> Attachments: LUCENE-2458.patch, LUCENE-2458.patch > >> > >> > >> The current method in the queryparser to generate phrasequeries is > wrong: > >> The Query Syntax documentation > >> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: > >> {noformat} > >> A Phrase is a group of words surrounded by double quotes such as > >> "hello dolly". > >> {noformat} > >> But as we know, this isn't actually true. > >> Instead the terms are first divided on whitespace, then the analyzer > >> term count is used as some sort of "heuristic" to determine if its a > >> phrase query or not. > >> This assumption is a disaster for languages that don't us
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Sorry hit the send button too early The QP should work like that: (1) It parses the query, creating fragments (2) It does some out-of-the-box handling of those fragments People should be able to override that handling of fragments. But people should not touch (1). And so if we keep thinking of QP that way, then we'll have THE QP for (1), because there can be only one, and a "we think this is best for most people" handling of fragments (2). So for (1), QP would create the fragment field + value, and it'll be your choice how to interpret that. QP will provide a default handling for it (today's behavior is fine w/ me). We already do this today in other, equally important places - IndexWriter. It has all sorts of configs and have a "what we think is good for most" defaults. People can override it, and that's it. And I think I do understand the issue Robert, even though I speak less languages then you are. It is all about tokenization. My QP does the parsing I've mentioned - it breaks the query into fragments and then handles them. There is a default behavior which people can override. And that behavior is all about tokenization. Shai On Sun, May 23, 2010 at 7:47 PM, Shai Erera wrote: > Robert - I hope hitting the keyboard hard makes you happy :) > > I do get the issue. And I still think that CJK queries are just a small > percentage of all queries that are used in the world today. Or at least by > Lucene. And I'm not sure why we want to change the default for ALL OTHER > LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR > !!?!?!?!?!?!??!?!?! > > > > > On Sun, May 23, 2010 at 7:42 PM, Robert Muir wrote: > >> These comments lead me to believe you don't understand the issue. >> >> Do you understand that *ALL* CJK queries are made into phrase queries, >> regardless of tokenizer?!!?!?! >> >> On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler wrote: >> > Same here, as already noted in the issue. >> > >> > >> > >> > Uwe >> > >> > >> > >> > - >> > >> > Uwe Schindler >> > >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > >> > http://www.thetaphi.de >> > >> > eMail: u...@thetaphi.de >> > >> > >> > >> > From: Shai Erera [mailto:ser...@gmail.com] >> > Sent: Sunday, May 23, 2010 6:34 PM >> > >> > To: dev@lucene.apache.org >> > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't >> generate >> > phrasequeries based on term count >> > >> > >> > >> > Robert - is the effect on scoring also on English and other European >> > languages? Or is it mostly for ngram-based languages, and especially >> CJK? >> > >> > I want to stress that not all ngram-based languages are affected by this >> > behavior, especially those for which we do ngram just because of a lack >> of >> > good tokenizer. >> > >> > That's why I'm not sure the default should be changed and I'm all for a >> > getter/setter. If however it turns out the default MUST be changed, then >> I >> > support the Version + getter/setter approach. >> > >> > Shai >> > >> > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) >> > wrote: >> > >> >[ >> > >> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410 >> > ] >> > >> > Uwe Schindler commented on LUCENE-2458: >> > --- >> > >> > Hi Robert, >> > >> > I also agree with Mark (as you know). We can have both: >> > - Version for a good default (3.1 will get the new non-phrase-query >> > behavior) >> > - A separate getsetter for this option >> > (set/getCreatePhraseQueryOnConcenattedTerms or whatever) >> > >> > This would give you the best from both worlds. >> > >> >> queryparser shouldn't generate phrasequeries based on term count >> >> >> >> >> >> Key: LUCENE-2458 >> >> URL: https://issues.apache.org/jira/browse/LUCENE-2458 >> >> Project: Lucene - Java >> >> Issue Type: Bug >> >> Components: QueryParser >> >>Reporter: Robert Muir >> >>Assignee:
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
its not just CJK queries, its in general any language not separated on whitespace. There are a lot of other languages that don't use whitespace the same way english does. On Sun, May 23, 2010 at 12:47 PM, Shai Erera wrote: > Robert - I hope hitting the keyboard hard makes you happy :) > > I do get the issue. And I still think that CJK queries are just a small > percentage of all queries that are used in the world today. Or at least by > Lucene. And I'm not sure why we want to change the default for ALL OTHER > LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR > !!?!?!?!?!?!??!?!?! > > > > On Sun, May 23, 2010 at 7:42 PM, Robert Muir wrote: >> >> These comments lead me to believe you don't understand the issue. >> >> Do you understand that *ALL* CJK queries are made into phrase queries, >> regardless of tokenizer?!!?!?! >> >> On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler wrote: >> > Same here, as already noted in the issue. >> > >> > >> > >> > Uwe >> > >> > >> > >> > - >> > >> > Uwe Schindler >> > >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > >> > http://www.thetaphi.de >> > >> > eMail: u...@thetaphi.de >> > >> > >> > >> > From: Shai Erera [mailto:ser...@gmail.com] >> > Sent: Sunday, May 23, 2010 6:34 PM >> > >> > To: dev@lucene.apache.org >> > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't >> > generate >> > phrasequeries based on term count >> > >> > >> > >> > Robert - is the effect on scoring also on English and other European >> > languages? Or is it mostly for ngram-based languages, and especially >> > CJK? >> > >> > I want to stress that not all ngram-based languages are affected by this >> > behavior, especially those for which we do ngram just because of a lack >> > of >> > good tokenizer. >> > >> > That's why I'm not sure the default should be changed and I'm all for a >> > getter/setter. If however it turns out the default MUST be changed, then >> > I >> > support the Version + getter/setter approach. >> > >> > Shai >> > >> > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) >> > wrote: >> > >> > [ >> > >> > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410 >> > ] >> > >> > Uwe Schindler commented on LUCENE-2458: >> > --- >> > >> > Hi Robert, >> > >> > I also agree with Mark (as you know). We can have both: >> > - Version for a good default (3.1 will get the new non-phrase-query >> > behavior) >> > - A separate getsetter for this option >> > (set/getCreatePhraseQueryOnConcenattedTerms or whatever) >> > >> > This would give you the best from both worlds. >> > >> >> queryparser shouldn't generate phrasequeries based on term count >> >> >> >> >> >> Key: LUCENE-2458 >> >> URL: https://issues.apache.org/jira/browse/LUCENE-2458 >> >> Project: Lucene - Java >> >> Issue Type: Bug >> >> Components: QueryParser >> >> Reporter: Robert Muir >> >> Assignee: Robert Muir >> >> Priority: Blocker >> >> Fix For: 3.1, 4.0 >> >> >> >> Attachments: LUCENE-2458.patch, LUCENE-2458.patch >> >> >> >> >> >> The current method in the queryparser to generate phrasequeries is >> >> wrong: >> >> The Query Syntax documentation >> >> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: >> >> {noformat} >> >> A Phrase is a group of words surrounded by double quotes such as "hello >> >> dolly". >> >> {noformat} >> >> But as we know, this isn't actually true. >> >> Instead the terms are first divided on whitespace, then the analyzer >> >> term >> >> count is used as some sort of "heuristic" to determine if its a phrase >> >> query >> >> or not. >>
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Robert - I hope hitting the keyboard hard makes you happy :) I do get the issue. And I still think that CJK queries are just a small percentage of all queries that are used in the world today. Or at least by Lucene. And I'm not sure why we want to change the default for ALL OTHER LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR !!?!?!?!?!?!??!?!?! On Sun, May 23, 2010 at 7:42 PM, Robert Muir wrote: > These comments lead me to believe you don't understand the issue. > > Do you understand that *ALL* CJK queries are made into phrase queries, > regardless of tokenizer?!!?!?! > > On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler wrote: > > Same here, as already noted in the issue. > > > > > > > > Uwe > > > > > > > > - > > > > Uwe Schindler > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > http://www.thetaphi.de > > > > eMail: u...@thetaphi.de > > > > > > > > From: Shai Erera [mailto:ser...@gmail.com] > > Sent: Sunday, May 23, 2010 6:34 PM > > > > To: dev@lucene.apache.org > > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't > generate > > phrasequeries based on term count > > > > > > > > Robert - is the effect on scoring also on English and other European > > languages? Or is it mostly for ngram-based languages, and especially CJK? > > > > I want to stress that not all ngram-based languages are affected by this > > behavior, especially those for which we do ngram just because of a lack > of > > good tokenizer. > > > > That's why I'm not sure the default should be changed and I'm all for a > > getter/setter. If however it turns out the default MUST be changed, then > I > > support the Version + getter/setter approach. > > > > Shai > > > > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) > > wrote: > > > >[ > > > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410 > > ] > > > > Uwe Schindler commented on LUCENE-2458: > > --- > > > > Hi Robert, > > > > I also agree with Mark (as you know). We can have both: > > - Version for a good default (3.1 will get the new non-phrase-query > > behavior) > > - A separate getsetter for this option > > (set/getCreatePhraseQueryOnConcenattedTerms or whatever) > > > > This would give you the best from both worlds. > > > >> queryparser shouldn't generate phrasequeries based on term count > >> > >> > >> Key: LUCENE-2458 > >> URL: https://issues.apache.org/jira/browse/LUCENE-2458 > >> Project: Lucene - Java > >> Issue Type: Bug > >> Components: QueryParser > >>Reporter: Robert Muir > >>Assignee: Robert Muir > >>Priority: Blocker > >> Fix For: 3.1, 4.0 > >> > >> Attachments: LUCENE-2458.patch, LUCENE-2458.patch > >> > >> > >> The current method in the queryparser to generate phrasequeries is > wrong: > >> The Query Syntax documentation > >> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: > >> {noformat} > >> A Phrase is a group of words surrounded by double quotes such as "hello > >> dolly". > >> {noformat} > >> But as we know, this isn't actually true. > >> Instead the terms are first divided on whitespace, then the analyzer > term > >> count is used as some sort of "heuristic" to determine if its a phrase > query > >> or not. > >> This assumption is a disaster for languages that don't use whitespace > >> separation: CJK, compounding European languages like German, Finnish, > etc. > >> It also > >> makes it difficult for people to use n-gram analysis techniques. In > these > >> cases you get bad relevance (MAP improves nearly *10x* if you use a > >> PositionFilter at query-time to "turn this off" for chinese). > >> For even english, this undocumented behavior is bad. Perhaps in some > cases > >> its being abused as some heuristic to "second guess" the tokenizer and > piece > >> back things it shouldn't have split, but for large collections, doing > thing
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Sun, May 23, 2010 at 12:34 PM, Shai Erera wrote: > I want to stress that not all ngram-based languages are affected by this > behavior, especially those for which we do ngram just because of a lack of > good tokenizer. > They are also affected! Do you understand how the queryparser treats whitespace? You cannot currently use "normal" word spanning n-grams with lucene because of this: 1) you can only use word-internal n-grams because each whitespace-separated word gets its own tokenstream 2) all queries here are also made into phrasequeries automatically, which is stupid as n-grams already contain the 'positional information' -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
These comments lead me to believe you don't understand the issue. Do you understand that *ALL* CJK queries are made into phrase queries, regardless of tokenizer?!!?!?! On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler wrote: > Same here, as already noted in the issue. > > > > Uwe > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > From: Shai Erera [mailto:ser...@gmail.com] > Sent: Sunday, May 23, 2010 6:34 PM > > To: dev@lucene.apache.org > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate > phrasequeries based on term count > > > > Robert - is the effect on scoring also on English and other European > languages? Or is it mostly for ngram-based languages, and especially CJK? > > I want to stress that not all ngram-based languages are affected by this > behavior, especially those for which we do ngram just because of a lack of > good tokenizer. > > That's why I'm not sure the default should be changed and I'm all for a > getter/setter. If however it turns out the default MUST be changed, then I > support the Version + getter/setter approach. > > Shai > > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) > wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410 > ] > > Uwe Schindler commented on LUCENE-2458: > --- > > Hi Robert, > > I also agree with Mark (as you know). We can have both: > - Version for a good default (3.1 will get the new non-phrase-query > behavior) > - A separate getsetter for this option > (set/getCreatePhraseQueryOnConcenattedTerms or whatever) > > This would give you the best from both worlds. > >> queryparser shouldn't generate phrasequeries based on term count >> >> >> Key: LUCENE-2458 >> URL: https://issues.apache.org/jira/browse/LUCENE-2458 >> Project: Lucene - Java >> Issue Type: Bug >> Components: QueryParser >> Reporter: Robert Muir >> Assignee: Robert Muir >> Priority: Blocker >> Fix For: 3.1, 4.0 >> >> Attachments: LUCENE-2458.patch, LUCENE-2458.patch >> >> >> The current method in the queryparser to generate phrasequeries is wrong: >> The Query Syntax documentation >> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: >> {noformat} >> A Phrase is a group of words surrounded by double quotes such as "hello >> dolly". >> {noformat} >> But as we know, this isn't actually true. >> Instead the terms are first divided on whitespace, then the analyzer term >> count is used as some sort of "heuristic" to determine if its a phrase query >> or not. >> This assumption is a disaster for languages that don't use whitespace >> separation: CJK, compounding European languages like German, Finnish, etc. >> It also >> makes it difficult for people to use n-gram analysis techniques. In these >> cases you get bad relevance (MAP improves nearly *10x* if you use a >> PositionFilter at query-time to "turn this off" for chinese). >> For even english, this undocumented behavior is bad. Perhaps in some cases >> its being abused as some heuristic to "second guess" the tokenizer and piece >> back things it shouldn't have split, but for large collections, doing things >> like generating phrasequeries because StandardTokenizer split a compound on >> a dash can cause serious performance problems. Instead people should analyze >> their text with the appropriate methods, and QueryParser should only >> generate phrase queries when the syntax asks for one. >> The PositionFilter in contrib can be seen as a workaround, but its pretty >> obscure and people are not familiar with it. The result is we have bad >> out-of-box behavior for many languages, and bad performance for others on >> some inputs. >> I propose instead that we change the grammar to actually look for double >> quotes to determine when to generate a phrase query, consistent with the >> documentation. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Same here, as already noted in the issue. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen <http://www.thetaphi.de/> http://www.thetaphi.de eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 23, 2010 6:34 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.pl ugin.system.issuetabpanels:comment-tabpanel <https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.p lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#actio n_12870410> &focusedCommentId=12870410#action_12870410 ] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. > queryparser shouldn't generate phrasequeries based on term count > > > Key: LUCENE-2458 > URL: https://issues.apache.org/jira/browse/LUCENE-2458 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Blocker > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2458.patch, LUCENE-2458.patch > > > The current method in the queryparser to generate phrasequeries is wrong: > The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: > {noformat} > A Phrase is a group of words surrounded by double quotes such as "hello dolly". > {noformat} > But as we know, this isn't actually true. > Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not. > This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also > makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese). > For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. > The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. > I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) wrote: > >[ > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410] > > Uwe Schindler commented on LUCENE-2458: > --- > > Hi Robert, > > I also agree with Mark (as you know). We can have both: > - Version for a good default (3.1 will get the new non-phrase-query > behavior) > - A separate getsetter for this option > (set/getCreatePhraseQueryOnConcenattedTerms or whatever) > > This would give you the best from both worlds. > > > queryparser shouldn't generate phrasequeries based on term count > > > > > > Key: LUCENE-2458 > > URL: https://issues.apache.org/jira/browse/LUCENE-2458 > > Project: Lucene - Java > > Issue Type: Bug > > Components: QueryParser > >Reporter: Robert Muir > >Assignee: Robert Muir > >Priority: Blocker > > Fix For: 3.1, 4.0 > > > > Attachments: LUCENE-2458.patch, LUCENE-2458.patch > > > > > > The current method in the queryparser to generate phrasequeries is wrong: > > The Query Syntax documentation ( > http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: > > {noformat} > > A Phrase is a group of words surrounded by double quotes such as "hello > dolly". > > {noformat} > > But as we know, this isn't actually true. > > Instead the terms are first divided on whitespace, then the analyzer term > count is used as some sort of "heuristic" to determine if its a phrase query > or not. > > This assumption is a disaster for languages that don't use whitespace > separation: CJK, compounding European languages like German, Finnish, etc. > It also > > makes it difficult for people to use n-gram analysis techniques. In these > cases you get bad relevance (MAP improves nearly *10x* if you use a > PositionFilter at query-time to "turn this off" for chinese). > > For even english, this undocumented behavior is bad. Perhaps in some > cases its being abused as some heuristic to "second guess" the tokenizer and > piece back things it shouldn't have split, but for large collections, doing > things like generating phrasequeries because StandardTokenizer split a > compound on a dash can cause serious performance problems. Instead people > should analyze their text with the appropriate methods, and QueryParser > should only generate phrase queries when the syntax asks for one. > > The PositionFilter in contrib can be seen as a workaround, but its pretty > obscure and people are not familiar with it. The result is we have bad > out-of-box behavior for many languages, and bad performance for others on > some inputs. > > I propose instead that we change the grammar to actually look for double > quotes to determine when to generate a phrase query, consistent with the > documentation. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Yes, in this issue there is a lot of unsubstantiated claims that this bug helps with broken english tokenization. I simply ask you to measure your claim... the scoring is already broken in this case! On May 23, 2010 9:42 AM, "Mark Miller" wrote: Obnoxiousness has certainly been in the air regarding this issue, I'll give you that. On Sunday, May 23, 2010, Robert Muir wrote: > I can't tell if you are being obno... -- - Mark http://www.lucidimagination.com ---... To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@...
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Obnoxiousness has certainly been in the air regarding this issue, I'll give you that. On Sunday, May 23, 2010, Robert Muir wrote: > I can't tell if you are being obnoxious or seriously believe what you say. > You understand that cjkanalyzer is broke with this? You understand that > ngrams themselves capture information about position and it even works nicely > with scoring, and helps. > > This hack doesn't help english. If you think otherwise, be a man and show > real results > On May 23, 2010 6:39 AM, "Shai Erera (JIRA)" wrote: > > > [ > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue... > - > To unsubscribe, e-mail: dev-un... > -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
I can't tell if you are being obnoxious or seriously believe what you say. You understand that cjkanalyzer is broke with this? You understand that ngrams themselves capture information about position and it even works nicely with scoring, and helps. This hack doesn't help english. If you think otherwise, be a man and show real results On May 23, 2010 6:39 AM, "Shai Erera (JIRA)" wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue. .. - To unsubscribe, e-mail: dev-un...
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Again, this is not a hack, and that was exactly my point. As I said: > resolving this is very simple, by just applying a correct logic > (ignore double-quotes followed by a char) which isn't enforced today > and once it will be, it won't cause any cases of unexpected behavior. It is just valid for English queries to ignore double-quotes in mid-word instead of tokenizing upon it if not followed by an empty char, as it is in Hebrew. Itamar. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, May 13, 2010 3:24 AM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count Internationalization doesn't work by just piling hacks for language X, language Y, and language Z on top of each other. Just like I want the English hack removed, I strongly recommend against adding any Hebrew hack. On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko wrote: > I think we understand each other perfectly well. I still think > resolving this is very simple, by just applying a correct logic > (ignore double-quotes followed by a char) which isn't enforced today > and once it will be, it won't cause any cases of unexpected behavior. > This isn't an analysis related task, and I'm not sure what makes you > insist so bad. I will be openning a dedicated JIRA ticket for this > discussion if this won't become part of the current one. > > Itamar. > > -Original Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Thursday, May 13, 2010 1:42 AM > To: dev@lucene.apache.org > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't > generate phrasequeries based on term count > > On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko > > wrote: >> Never did I request the QP to do Analysis. I simply mentioned this >> bug >> - what this definitely is - > > Its definitely not a bug for Hebrew, there is a unicode character for > gershayim (U+05F4), so technically this should be used according to unicode. > > Its arguably your responsibility to convert your data to unicode > before passing it thru Lucene, and that includes disambiguating when a > double quote should be gershayim > > -- > Robert Muir > rcm...@gmail.com > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For > additional commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For > additional commands, e-mail: dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Internationalization doesn't work by just piling hacks for language X, language Y, and language Z on top of each other. Just like I want the English hack removed, I strongly recommend against adding any Hebrew hack. On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko wrote: > I think we understand each other perfectly well. I still think resolving > this is very simple, by just applying a correct logic (ignore double-quotes > followed by a char) which isn't enforced today and once it will be, it won't > cause any cases of unexpected behavior. This isn't an analysis related task, > and I'm not sure what makes you insist so bad. I will be openning a > dedicated JIRA ticket for this discussion if this won't become part of the > current one. > > Itamar. > > -Original Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Thursday, May 13, 2010 1:42 AM > To: dev@lucene.apache.org > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate > phrasequeries based on term count > > On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko > wrote: >> Never did I request the QP to do Analysis. I simply mentioned this bug >> - what this definitely is - > > Its definitely not a bug for Hebrew, there is a unicode character for > gershayim (U+05F4), so technically this should be used according to unicode. > > Its arguably your responsibility to convert your data to unicode before > passing it thru Lucene, and that includes disambiguating when a double quote > should be gershayim > > -- > Robert Muir > rcm...@gmail.com > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional > commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
I think we understand each other perfectly well. I still think resolving this is very simple, by just applying a correct logic (ignore double-quotes followed by a char) which isn't enforced today and once it will be, it won't cause any cases of unexpected behavior. This isn't an analysis related task, and I'm not sure what makes you insist so bad. I will be openning a dedicated JIRA ticket for this discussion if this won't become part of the current one. Itamar. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, May 13, 2010 1:42 AM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko wrote: > Never did I request the QP to do Analysis. I simply mentioned this bug > - what this definitely is - Its definitely not a bug for Hebrew, there is a unicode character for gershayim (U+05F4), so technically this should be used according to unicode. Its arguably your responsibility to convert your data to unicode before passing it thru Lucene, and that includes disambiguating when a double quote should be gershayim -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko wrote: > Never did I request the QP to do Analysis. I simply mentioned this bug - > what this definitely is - Its definitely not a bug for Hebrew, there is a unicode character for gershayim (U+05F4), so technically this should be used according to unicode. Its arguably your responsibility to convert your data to unicode before passing it thru Lucene, and that includes disambiguating when a double quote should be gershayim -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Never did I request the QP to do Analysis. I simply mentioned this bug - what this definitely is - so you could tackle it while you're at it. This is an definitely relevant to a discussion about re-making how the QP determines what is a legit PhraseQuery and what is not. The fix is quite easy I believe - just make sure you don't identify a double-quote as a trigger for starting or ending a phrase unless it is followed by a white-space (or another non-char). An English query like 'Foo"bar"' (with no enclosing quotes...) is invalid anyway (although it is not handled as such at the moment). I cannot handle this on the application side, simply because there the double-quote char is NOT a special character. As I mentioned, for Hebrew it is part of the word, pretty much like Niqqud is. If the user has entered a textual query with an acronym, there's no point in me parsing it once just to escape what I suspect are acronyms and then send it to the core QP, or just create the queries by myself. All this being valid in light of my second paragraph in this message - the fix is easy and also correct for the basic, non-Hebrew, implementation. Itamar. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, May 12, 2010 4:25 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko wrote: > The QueryParser also fails to correctly parse Hebrew acronyms; > although not being an integral part of the current discussion, I > thought this would be the best place to bring that up. > Just as I don't think Analysis should do QueryParsing, I don't think QueryParsing should do Analysis either. Similar problems to this exist in other languages (I have to escape : for some, because lucene wants to interpret it as a field name). But this can be easily remedied on the application side, its documented and understood that the double-quote is a special character, and there is an escape mechanism so you can escape the ones you think are acronyms. This issue is about about a buggy implementation: its not documented and only internal to how the queryparser determines what is a phrase query or not (and, contrary to what you would believe from the documentation, the choice of whether or not to make a PhraseQuery is not based on syntax one bit!) -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On 5/12/10 11:24 AM, Robert Muir wrote: On Wed, May 12, 2010 at 11:16 AM, Mark Miller wrote: Thats a major exaggeration - quoting text plays a large role in whether or not you will get a phrase query. No, it has nothing to do with it in the implementation. It only "escapes the whitespace", but is discarded. This is clear from looking at the grammar. The logic then to determine if you get a phrase query is the huge mess of code in getFieldQuery, but its not based on the double quotes at all. For example a list of chinese or thai words gets a phrase query, only because they don't use whitespace between words. But a similar list of english words gets a boolean query. Quotes play a part, or quoting something would simply not create a phrase query - quoting something ensures that it hits the analyzer as one chunk, rather than getting meta parsed by the grammar and fed to the analyzer a token at a time. This ensures that multiple tokens hit the funky logic to create a phrase query. The grammar specifically looks for quoted chunks. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Wed, May 12, 2010 at 11:16 AM, Mark Miller wrote: > > Thats a major exaggeration - quoting text plays a large role in whether or > not you will get a phrase query. > No, it has nothing to do with it in the implementation. It only "escapes the whitespace", but is discarded. This is clear from looking at the grammar. The logic then to determine if you get a phrase query is the huge mess of code in getFieldQuery, but its not based on the double quotes at all. For example a list of chinese or thai words gets a phrase query, only because they don't use whitespace between words. But a similar list of english words gets a boolean query. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On 5/12/10 9:25 AM, Robert Muir wrote: (and, contrary to what you would believe from the documentation, the choice of whether or not to make a PhraseQuery is not based on syntax one bit!) Thats a major exaggeration - quoting text plays a large role in whether or not you will get a phrase query. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko wrote: > The QueryParser also fails to correctly parse Hebrew acronyms; although not > being an integral part of the current discussion, I thought this would be > the best place to bring that up. > Just as I don't think Analysis should do QueryParsing, I don't think QueryParsing should do Analysis either. Similar problems to this exist in other languages (I have to escape : for some, because lucene wants to interpret it as a field name). But this can be easily remedied on the application side, its documented and understood that the double-quote is a special character, and there is an escape mechanism so you can escape the ones you think are acronyms. This issue is about about a buggy implementation: its not documented and only internal to how the queryparser determines what is a phrase query or not (and, contrary to what you would believe from the documentation, the choice of whether or not to make a PhraseQuery is not based on syntax one bit!) -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
The QueryParser also fails to correctly parse Hebrew acronyms; although not being an integral part of the current discussion, I thought this would be the best place to bring that up. Hebrew acronyms are assembled of letters with a single double-quote char within, example: MNK"L (Hebrew for CEO). That double-quote char usually comes at the before-last position of the word, but for some cases it can come before (MNK"LIT). Since the QP expects two sets of double-quotes enclosing a phrase, an exception will be thrown if such a word has been passed to it, or an incorrect phrase query will be produced if two acronyms are used together in a query string. Not sure which is worse. Perhaps while you're at it you could make sure to only create a phrase query if a quote is followed by a space - hence is definitely at the end of a word, and not just assume it to be equivalent to a white space? Although there's no good open Hebrew analyzer for Lucene yet hence no motivation for this to be fixed, I'm working on one as we speak and hopefully will have something to show in the next few weeks/days. It would be nice to have at least this issue closed within the Lucene core code. Thanks, Itamar Syn-Hershko - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org