Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
On Sun, May 23, 2010 at 2:39 PM, Uwe Schindler  wrote:

> In 3.1 we dont remove QP from core, so lets fix it there. In 4.0 we can 
> possibly have a totally new QP with no backwards at all, so no problem. As 
> Earwin noted in another reply: His comment is the way QP should work, but 
> this is hard to do with analyzers in front (I would have some ideas). But 
> then we must do it without crappy javacc or jflex and with Analyzer only + 
> some self-coded stuff only.


I don't want to "give up" on fixing this second whitespace problem
until 4.0. Mostly because it prevents Lucene from having proper n-gram
search where n-grams are word-spanning (this is the classic case that
is well understood and gives good results for lots of Euro languages).

So this is why i was against this option, because I want to fix this
later too, and not make it impossible.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Uwe Schindler
> On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler  wrote:
> >  I just want to make the feature accessible and documented without
> Version.
> 
> I think it is just a bug (a shoddy implementation that does not use the 
> syntax,
> whether it was quoted or not, since this has been thrown away). In this
> implementation no one thought about languages that don't use whitespace
> and that it would make all queries into phrasequeries.

I am happy with your API changes that have the additional param "quoted", but 
based on that we should also have the Boolean switch to preserve the old 
behavior with non-CJK-or-other-non-whitespace changes. We could also add a 
method to TokenStream/Analyzer that is "isTokenizingOnWhitespace" (don’t read 
this seriously!).

> I really do not think this sort of code belongs inside core lucene, if you 
> want
> to make uninternationalized code in your own code base that is not correct
> that is fine.

Me affects the issue not at all, I (and also Shai) use own query parsers that 
do exactly that (and my code is using mixed language, preferabely English-only 
with some foreign fragments) and no CJK and other stuff. So here you are 
correct, I only see thousands of issues and bug reports. Because in western 
European world, uses build phrases the way I said (and that is what was behind 
the code you don’t like).

> Furthermore by preserving this kind of bug it makes the queryparser more
> complicated, and especially in the future. If at some point in the future you
> want to really have the QP not split on whitespace (as you yourself said on
> the issue you want) to enable support for multi-word synonyms and "real" n-
> grams at querytime, I hope you understand this buggy code conflicts and
> complicates this later goal.

In 3.1 we dont remove QP from core, so lets fix it there. In 4.0 we can 
possibly have a totally new QP with no backwards at all, so no problem. As 
Earwin noted in another reply: His comment is the way QP should work, but this 
is hard to do with analyzers in front (I would have some ideas). But then we 
must do it without crappy javacc or jflex and with Analyzer only + some 
self-coded stuff only.

Let's drink some beer and think about it (too bad that I am out of Czech now). 
Possibly at Berlin! :-)

Uwe



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
+1, this is what the patch does. I agree i did a crappy job explaining
the issue.

On Sun, May 23, 2010 at 2:25 PM, Shai Erera  wrote:
> So ... after a long IRC chat on this, I think this has just been worded
> incorrectly (the issue). As I understand, there are two issues here:
> 1) QP loses a phrase info for fields -- the query f:"abcd" and f:abcd are
> parsed the same, or handled the same. There is no way for the one extending
> QP to tell if quotes were used.
> 2) QP has a default impl for f:abcd which is not international-friendly.
>
> I agree (1) should be fixed, and I apologize if I missed that previously.
> Version is the right way to go with this.
>
> About (2), I think that if f:abcd is submitted, then a PQ should not be
> created. The user hasn't asked for it. But if f:"abcd" was submitted, then
> it is ok to create a PQ by default. And we're only talking about defaults
> here. Anyone should be able to extend QP and override the relevant
> getFieldQuery variant and do whatever he wants.
>
> If the question on what should be the default behavior for (2), then I think
> pending Version, it should create a PQ for f:"abcd" only. And we leave it to
> the extended to determine what should be his right behavior.
>
> Shai
>
> On Sun, May 23, 2010 at 9:09 PM, Robert Muir  wrote:
>>
>> On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler  wrote:
>> >  I just want to make the feature accessible and documented without
>> > Version.
>>
>> I think it is just a bug (a shoddy implementation that does not use
>> the syntax, whether it was quoted or not, since this has been thrown
>> away). In this implementation no one thought about languages that
>> don't use whitespace and that it would make all queries into
>> phrasequeries.
>>
>> I really do not think this sort of code belongs inside core lucene, if
>> you want to make uninternationalized code in your own code base that
>> is not correct that is fine.
>>
>> Furthermore by preserving this kind of bug it makes the queryparser
>> more complicated, and especially in the future. If at some point in
>> the future you want to really have the QP not split on whitespace (as
>> you yourself said on the issue you want) to enable support for
>> multi-word synonyms and "real" n-grams at querytime, I hope you
>> understand this buggy code conflicts and complicates this later goal.
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera
So ... after a long IRC chat on this, I think this has just been worded
incorrectly (the issue). As I understand, there are two issues here:
1) QP loses a phrase info for fields -- the query f:"abcd" and f:abcd are
parsed the same, or handled the same. There is no way for the one extending
QP to tell if quotes were used.
2) QP has a default impl for f:abcd which is not international-friendly.

I agree (1) should be fixed, and I apologize if I missed that previously.
Version is the right way to go with this.

About (2), I think that if f:abcd is submitted, then a PQ should not be
created. The user hasn't asked for it. But if f:"abcd" was submitted, then
it is ok to create a PQ by default. And we're only talking about defaults
here. Anyone should be able to extend QP and override the relevant
getFieldQuery variant and do whatever he wants.

If the question on what should be the default behavior for (2), then I think
pending Version, it should create a PQ for f:"abcd" only. And we leave it to
the extended to determine what should be his right behavior.

Shai

On Sun, May 23, 2010 at 9:09 PM, Robert Muir  wrote:

> On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler  wrote:
> >  I just want to make the feature accessible and documented without
> Version.
>
> I think it is just a bug (a shoddy implementation that does not use
> the syntax, whether it was quoted or not, since this has been thrown
> away). In this implementation no one thought about languages that
> don't use whitespace and that it would make all queries into
> phrasequeries.
>
> I really do not think this sort of code belongs inside core lucene, if
> you want to make uninternationalized code in your own code base that
> is not correct that is fine.
>
> Furthermore by preserving this kind of bug it makes the queryparser
> more complicated, and especially in the future. If at some point in
> the future you want to really have the QP not split on whitespace (as
> you yourself said on the issue you want) to enable support for
> multi-word synonyms and "real" n-grams at querytime, I hope you
> understand this buggy code conflicts and complicates this later goal.
>
> --
> Robert Muir
> rcm...@gmail.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler  wrote:
>  I just want to make the feature accessible and documented without Version.

I think it is just a bug (a shoddy implementation that does not use
the syntax, whether it was quoted or not, since this has been thrown
away). In this implementation no one thought about languages that
don't use whitespace and that it would make all queries into
phrasequeries.

I really do not think this sort of code belongs inside core lucene, if
you want to make uninternationalized code in your own code base that
is not correct that is fine.

Furthermore by preserving this kind of bug it makes the queryparser
more complicated, and especially in the future. If at some point in
the future you want to really have the QP not split on whitespace (as
you yourself said on the issue you want) to enable support for
multi-word synonyms and "real" n-grams at querytime, I hope you
understand this buggy code conflicts and complicates this later goal.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Earwin Burrfoot
> The QP should work like that:
> (1) It parses the query, creating fragments
> (2) It does some out-of-the-box handling of those fragments
>
> People should be able to override that handling of fragments. But people
> should not touch (1).

In fact QP should work like that:
(1) Tokenizer parses the query as if it was a string of text.
Care must be taken to preserve query language operators, as this stage
essentially replaces current QP's lexer stage.
(2) QP's syntax parser kicks in, identifies operators (those that
Tokenizer didn't treat as a part of word tokens) and does overridable
out-of-the-box handling for them and tokens around them.

The point is - it's hard to do correctly. That's why Lucene resorts to
upside-down approach.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Uwe Schindler
Yes I understand that. But because of this it is still not a bug, it is a 
"feature" (and also implemented like that) to build phrase queries without 
Quotes, e.g. by simply appending works with ASCII-hyphens (for most European 
analyzers). And exactly to preserve this behavior, lets simply switch it on/of 
using a getsetter. That’s all I want, really. I know you are right and I still 
want to drink beer with you in Berlin and not being killed :-) I just want to 
make the feature accessible and documented without Version. The idea behind 
Version would be contradicted. Also the "feature" would go in 4.0.

That’s all and I hope that you understand my argument.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Sunday, May 23, 2010 6:43 PM
> To: dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
> generate phrasequeries based on term count
> 
> These comments lead me to believe you don't understand the issue.
> 
> Do you understand that *ALL* CJK queries are made into phrase queries,
> regardless of tokenizer?!!?!?!
> 
> On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler 
> wrote:
> > Same here, as already noted in the issue.
> >
> >
> >
> > Uwe
> >
> >
> >
> > -
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: u...@thetaphi.de
> >
> >
> >
> > From: Shai Erera [mailto:ser...@gmail.com]
> > Sent: Sunday, May 23, 2010 6:34 PM
> >
> > To: dev@lucene.apache.org
> > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
> > generate phrasequeries based on term count
> >
> >
> >
> > Robert - is the effect on scoring also on English and other European
> > languages? Or is it mostly for ngram-based languages, and especially CJK?
> >
> > I want to stress that not all ngram-based languages are affected by
> > this behavior, especially those for which we do ngram just because of
> > a lack of good tokenizer.
> >
> > That's why I'm not sure the default should be changed and I'm all for
> > a getter/setter. If however it turns out the default MUST be changed,
> > then I support the Version + getter/setter approach.
> >
> > Shai
> >
> > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA)
> > 
> > wrote:
> >
> >[
> > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.j
> > ira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=128
> > 70410#action_12870410
> > ]
> >
> > Uwe Schindler commented on LUCENE-2458:
> > ---
> >
> > Hi Robert,
> >
> > I also agree with Mark (as you know). We can have both:
> > - Version for a good default (3.1 will get the new non-phrase-query
> > behavior)
> > - A separate getsetter for this option
> > (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
> >
> > This would give you the best from both worlds.
> >
> >> queryparser shouldn't generate phrasequeries based on term count
> >> 
> >>
> >> Key: LUCENE-2458
> >> URL:
> >> https://issues.apache.org/jira/browse/LUCENE-2458
> >> Project: Lucene - Java
> >>  Issue Type: Bug
> >>  Components: QueryParser
> >>Reporter: Robert Muir
> >>Assignee: Robert Muir
> >>Priority: Blocker
> >> Fix For: 3.1, 4.0
> >>
> >> Attachments: LUCENE-2458.patch, LUCENE-2458.patch
> >>
> >>
> >> The current method in the queryparser to generate phrasequeries is
> wrong:
> >> The Query Syntax documentation
> >> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> >> {noformat}
> >> A Phrase is a group of words surrounded by double quotes such as
> >> "hello dolly".
> >> {noformat}
> >> But as we know, this isn't actually true.
> >> Instead the terms are first divided on whitespace, then the analyzer
> >> term count is used as some sort of "heuristic" to determine if its a
> >> phrase query or not.
> >> This assumption is a disaster for languages that don't us

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera
Sorry hit the send button too early

The QP should work like that:
(1) It parses the query, creating fragments
(2) It does some out-of-the-box handling of those fragments

People should be able to override that handling of fragments. But people
should not touch (1).

And so if we keep thinking of QP that way, then we'll have THE QP for (1),
because there can be only one, and a "we think this is best for most people"
handling of fragments (2). So for (1), QP would create the fragment field +
value, and it'll be your choice how to interpret that. QP will provide a
default handling for it (today's behavior is fine w/ me).

We already do this today in other, equally important places - IndexWriter.
It has all sorts of configs and have a "what we think is good for most"
defaults. People can override it, and that's it.

And I think I do understand the issue Robert, even though I speak less
languages then you are. It is all about tokenization. My QP does the parsing
I've mentioned - it breaks the query into fragments and then handles them.
There is a default behavior which people can override. And that behavior is
all about tokenization.

Shai

On Sun, May 23, 2010 at 7:47 PM, Shai Erera  wrote:

> Robert - I hope hitting the keyboard hard makes you happy :)
>
> I do get the issue. And I still think that CJK queries are just a small
> percentage of all queries that are used in the world today. Or at least by
> Lucene. And I'm not sure why we want to change the default for ALL OTHER
> LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR
> !!?!?!?!?!?!??!?!?!
>
>
>
>
> On Sun, May 23, 2010 at 7:42 PM, Robert Muir  wrote:
>
>> These comments lead me to believe you don't understand the issue.
>>
>> Do you understand that *ALL* CJK queries are made into phrase queries,
>> regardless of tokenizer?!!?!?!
>>
>> On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler  wrote:
>> > Same here, as already noted in the issue.
>> >
>> >
>> >
>> > Uwe
>> >
>> >
>> >
>> > -
>> >
>> > Uwe Schindler
>> >
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >
>> > http://www.thetaphi.de
>> >
>> > eMail: u...@thetaphi.de
>> >
>> >
>> >
>> > From: Shai Erera [mailto:ser...@gmail.com]
>> > Sent: Sunday, May 23, 2010 6:34 PM
>> >
>> > To: dev@lucene.apache.org
>> > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
>> generate
>> > phrasequeries based on term count
>> >
>> >
>> >
>> > Robert - is the effect on scoring also on English and other European
>> > languages? Or is it mostly for ngram-based languages, and especially
>> CJK?
>> >
>> > I want to stress that not all ngram-based languages are affected by this
>> > behavior, especially those for which we do ngram just because of a lack
>> of
>> > good tokenizer.
>> >
>> > That's why I'm not sure the default should be changed and I'm all for a
>> > getter/setter. If however it turns out the default MUST be changed, then
>> I
>> > support the Version + getter/setter approach.
>> >
>> > Shai
>> >
>> > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) 
>> > wrote:
>> >
>> >[
>> >
>> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410
>> > ]
>> >
>> > Uwe Schindler commented on LUCENE-2458:
>> > ---
>> >
>> > Hi Robert,
>> >
>> > I also agree with Mark (as you know). We can have both:
>> > - Version for a good default (3.1 will get the new non-phrase-query
>> > behavior)
>> > - A separate getsetter for this option
>> > (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
>> >
>> > This would give you the best from both worlds.
>> >
>> >> queryparser shouldn't generate phrasequeries based on term count
>> >> 
>> >>
>> >> Key: LUCENE-2458
>> >> URL: https://issues.apache.org/jira/browse/LUCENE-2458
>> >> Project: Lucene - Java
>> >>  Issue Type: Bug
>> >>  Components: QueryParser
>> >>Reporter: Robert Muir
>> >>Assignee: 

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
its not just CJK queries, its in general any language not separated on
whitespace.

There are a lot of other languages that don't use whitespace the same
way english does.

On Sun, May 23, 2010 at 12:47 PM, Shai Erera  wrote:
> Robert - I hope hitting the keyboard hard makes you happy :)
>
> I do get the issue. And I still think that CJK queries are just a small
> percentage of all queries that are used in the world today. Or at least by
> Lucene. And I'm not sure why we want to change the default for ALL OTHER
> LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR
> !!?!?!?!?!?!??!?!?!
>
>
>
> On Sun, May 23, 2010 at 7:42 PM, Robert Muir  wrote:
>>
>> These comments lead me to believe you don't understand the issue.
>>
>> Do you understand that *ALL* CJK queries are made into phrase queries,
>> regardless of tokenizer?!!?!?!
>>
>> On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler  wrote:
>> > Same here, as already noted in the issue.
>> >
>> >
>> >
>> > Uwe
>> >
>> >
>> >
>> > -
>> >
>> > Uwe Schindler
>> >
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >
>> > http://www.thetaphi.de
>> >
>> > eMail: u...@thetaphi.de
>> >
>> >
>> >
>> > From: Shai Erera [mailto:ser...@gmail.com]
>> > Sent: Sunday, May 23, 2010 6:34 PM
>> >
>> > To: dev@lucene.apache.org
>> > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
>> > generate
>> > phrasequeries based on term count
>> >
>> >
>> >
>> > Robert - is the effect on scoring also on English and other European
>> > languages? Or is it mostly for ngram-based languages, and especially
>> > CJK?
>> >
>> > I want to stress that not all ngram-based languages are affected by this
>> > behavior, especially those for which we do ngram just because of a lack
>> > of
>> > good tokenizer.
>> >
>> > That's why I'm not sure the default should be changed and I'm all for a
>> > getter/setter. If however it turns out the default MUST be changed, then
>> > I
>> > support the Version + getter/setter approach.
>> >
>> > Shai
>> >
>> > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) 
>> > wrote:
>> >
>> >    [
>> >
>> > https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410
>> > ]
>> >
>> > Uwe Schindler commented on LUCENE-2458:
>> > ---
>> >
>> > Hi Robert,
>> >
>> > I also agree with Mark (as you know). We can have both:
>> > - Version for a good default (3.1 will get the new non-phrase-query
>> > behavior)
>> > - A separate getsetter for this option
>> > (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
>> >
>> > This would give you the best from both worlds.
>> >
>> >> queryparser shouldn't generate phrasequeries based on term count
>> >> 
>> >>
>> >>                 Key: LUCENE-2458
>> >>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>> >>             Project: Lucene - Java
>> >>          Issue Type: Bug
>> >>          Components: QueryParser
>> >>            Reporter: Robert Muir
>> >>            Assignee: Robert Muir
>> >>            Priority: Blocker
>> >>             Fix For: 3.1, 4.0
>> >>
>> >>         Attachments: LUCENE-2458.patch, LUCENE-2458.patch
>> >>
>> >>
>> >> The current method in the queryparser to generate phrasequeries is
>> >> wrong:
>> >> The Query Syntax documentation
>> >> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
>> >> {noformat}
>> >> A Phrase is a group of words surrounded by double quotes such as "hello
>> >> dolly".
>> >> {noformat}
>> >> But as we know, this isn't actually true.
>> >> Instead the terms are first divided on whitespace, then the analyzer
>> >> term
>> >> count is used as some sort of "heuristic" to determine if its a phrase
>> >> query
>> >> or not.
>>

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera
Robert - I hope hitting the keyboard hard makes you happy :)

I do get the issue. And I still think that CJK queries are just a small
percentage of all queries that are used in the world today. Or at least by
Lucene. And I'm not sure why we want to change the default for ALL OTHER
LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR
!!?!?!?!?!?!??!?!?!



On Sun, May 23, 2010 at 7:42 PM, Robert Muir  wrote:

> These comments lead me to believe you don't understand the issue.
>
> Do you understand that *ALL* CJK queries are made into phrase queries,
> regardless of tokenizer?!!?!?!
>
> On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler  wrote:
> > Same here, as already noted in the issue.
> >
> >
> >
> > Uwe
> >
> >
> >
> > -
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: u...@thetaphi.de
> >
> >
> >
> > From: Shai Erera [mailto:ser...@gmail.com]
> > Sent: Sunday, May 23, 2010 6:34 PM
> >
> > To: dev@lucene.apache.org
> > Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
> generate
> > phrasequeries based on term count
> >
> >
> >
> > Robert - is the effect on scoring also on English and other European
> > languages? Or is it mostly for ngram-based languages, and especially CJK?
> >
> > I want to stress that not all ngram-based languages are affected by this
> > behavior, especially those for which we do ngram just because of a lack
> of
> > good tokenizer.
> >
> > That's why I'm not sure the default should be changed and I'm all for a
> > getter/setter. If however it turns out the default MUST be changed, then
> I
> > support the Version + getter/setter approach.
> >
> > Shai
> >
> > On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) 
> > wrote:
> >
> >[
> >
> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410
> > ]
> >
> > Uwe Schindler commented on LUCENE-2458:
> > ---
> >
> > Hi Robert,
> >
> > I also agree with Mark (as you know). We can have both:
> > - Version for a good default (3.1 will get the new non-phrase-query
> > behavior)
> > - A separate getsetter for this option
> > (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
> >
> > This would give you the best from both worlds.
> >
> >> queryparser shouldn't generate phrasequeries based on term count
> >> 
> >>
> >> Key: LUCENE-2458
> >> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> >> Project: Lucene - Java
> >>  Issue Type: Bug
> >>  Components: QueryParser
> >>Reporter: Robert Muir
> >>Assignee: Robert Muir
> >>Priority: Blocker
> >> Fix For: 3.1, 4.0
> >>
> >> Attachments: LUCENE-2458.patch, LUCENE-2458.patch
> >>
> >>
> >> The current method in the queryparser to generate phrasequeries is
> wrong:
> >> The Query Syntax documentation
> >> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> >> {noformat}
> >> A Phrase is a group of words surrounded by double quotes such as "hello
> >> dolly".
> >> {noformat}
> >> But as we know, this isn't actually true.
> >> Instead the terms are first divided on whitespace, then the analyzer
> term
> >> count is used as some sort of "heuristic" to determine if its a phrase
> query
> >> or not.
> >> This assumption is a disaster for languages that don't use whitespace
> >> separation: CJK, compounding European languages like German, Finnish,
> etc.
> >> It also
> >> makes it difficult for people to use n-gram analysis techniques. In
> these
> >> cases you get bad relevance (MAP improves nearly *10x* if you use a
> >> PositionFilter at query-time to "turn this off" for chinese).
> >> For even english, this undocumented behavior is bad. Perhaps in some
> cases
> >> its being abused as some heuristic to "second guess" the tokenizer and
> piece
> >> back things it shouldn't have split, but for large collections, doing
> thing

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
On Sun, May 23, 2010 at 12:34 PM, Shai Erera  wrote:

> I want to stress that not all ngram-based languages are affected by this
> behavior, especially those for which we do ngram just because of a lack of
> good tokenizer.
>

They are also affected! Do you understand how the queryparser treats
whitespace? You cannot currently use "normal" word spanning n-grams
with lucene because of this:

1) you can only use word-internal n-grams because each
whitespace-separated word gets its own tokenstream
2) all queries here are also made into phrasequeries automatically,
which is stupid as n-grams already contain the 'positional
information'

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
These comments lead me to believe you don't understand the issue.

Do you understand that *ALL* CJK queries are made into phrase queries,
regardless of tokenizer?!!?!?!

On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler  wrote:
> Same here, as already noted in the issue.
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
> From: Shai Erera [mailto:ser...@gmail.com]
> Sent: Sunday, May 23, 2010 6:34 PM
>
> To: dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
> phrasequeries based on term count
>
>
>
> Robert - is the effect on scoring also on English and other European
> languages? Or is it mostly for ngram-based languages, and especially CJK?
>
> I want to stress that not all ngram-based languages are affected by this
> behavior, especially those for which we do ngram just because of a lack of
> good tokenizer.
>
> That's why I'm not sure the default should be changed and I'm all for a
> getter/setter. If however it turns out the default MUST be changed, then I
> support the Version + getter/setter approach.
>
> Shai
>
> On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) 
> wrote:
>
>    [
> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410
> ]
>
> Uwe Schindler commented on LUCENE-2458:
> ---
>
> Hi Robert,
>
> I also agree with Mark (as you know). We can have both:
> - Version for a good default (3.1 will get the new non-phrase-query
> behavior)
> - A separate getsetter for this option
> (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
>
> This would give you the best from both worlds.
>
>> queryparser shouldn't generate phrasequeries based on term count
>> 
>>
>>                 Key: LUCENE-2458
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: QueryParser
>>            Reporter: Robert Muir
>>            Assignee: Robert Muir
>>            Priority: Blocker
>>             Fix For: 3.1, 4.0
>>
>>         Attachments: LUCENE-2458.patch, LUCENE-2458.patch
>>
>>
>> The current method in the queryparser to generate phrasequeries is wrong:
>> The Query Syntax documentation
>> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
>> {noformat}
>> A Phrase is a group of words surrounded by double quotes such as "hello
>> dolly".
>> {noformat}
>> But as we know, this isn't actually true.
>> Instead the terms are first divided on whitespace, then the analyzer term
>> count is used as some sort of "heuristic" to determine if its a phrase query
>> or not.
>> This assumption is a disaster for languages that don't use whitespace
>> separation: CJK, compounding European languages like German, Finnish, etc.
>> It also
>> makes it difficult for people to use n-gram analysis techniques. In these
>> cases you get bad relevance (MAP improves nearly *10x* if you use a
>> PositionFilter at query-time to "turn this off" for chinese).
>> For even english, this undocumented behavior is bad. Perhaps in some cases
>> its being abused as some heuristic to "second guess" the tokenizer and piece
>> back things it shouldn't have split, but for large collections, doing things
>> like generating phrasequeries because StandardTokenizer split a compound on
>> a dash can cause serious performance problems. Instead people should analyze
>> their text with the appropriate methods, and QueryParser should only
>> generate phrase queries when the syntax asks for one.
>> The PositionFilter in contrib can be seen as a workaround, but its pretty
>> obscure and people are not familiar with it. The result is we have bad
>> out-of-box behavior for many languages, and bad performance for others on
>> some inputs.
>> I propose instead that we change the grammar to actually look for double
>> quotes to determine when to generate a phrase query, consistent with the
>> documentation.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
>
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Uwe Schindler
Same here, as already noted in the issue.

 

Uwe

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Shai Erera [mailto:ser...@gmail.com] 
Sent: Sunday, May 23, 2010 6:34 PM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

 

Robert - is the effect on scoring also on English and other European
languages? Or is it mostly for ngram-based languages, and especially CJK?

I want to stress that not all ngram-based languages are affected by this
behavior, especially those for which we do ngram just because of a lack of
good tokenizer.

That's why I'm not sure the default should be changed and I'm all for a
getter/setter. If however it turns out the default MUST be changed, then I
support the Version + getter/setter approach.

Shai

On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) 
wrote:


   [
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.pl
ugin.system.issuetabpanels:comment-tabpanel
<https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.p
lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#actio
n_12870410> &focusedCommentId=12870410#action_12870410 ]

Uwe Schindler commented on LUCENE-2458:
---

Hi Robert,

I also agree with Mark (as you know). We can have both:
- Version for a good default (3.1 will get the new non-phrase-query
behavior)
- A separate getsetter for this option
(set/getCreatePhraseQueryOnConcenattedTerms or whatever)

This would give you the best from both worlds.

> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Blocker
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2458.patch, LUCENE-2458.patch
>
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation
(http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello
dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term
count is used as some sort of "heuristic" to determine if its a phrase query
or not.
> This assumption is a disaster for languages that don't use whitespace
separation: CJK, compounding European languages like German, Finnish, etc.
It also
> makes it difficult for people to use n-gram analysis techniques. In these
cases you get bad relevance (MAP improves nearly *10x* if you use a
PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases
its being abused as some heuristic to "second guess" the tokenizer and piece
back things it shouldn't have split, but for large collections, doing things
like generating phrasequeries because StandardTokenizer split a compound on
a dash can cause serious performance problems. Instead people should analyze
their text with the appropriate methods, and QueryParser should only
generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty
obscure and people are not familiar with it. The result is we have bad
out-of-box behavior for many languages, and bad performance for others on
some inputs.
> I propose instead that we change the grammar to actually look for double
quotes to determine when to generate a phrase query, consistent with the
documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

For additional commands, e-mail: dev-h...@lucene.apache.org

 



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera
Robert - is the effect on scoring also on English and other European
languages? Or is it mostly for ngram-based languages, and especially CJK?

I want to stress that not all ngram-based languages are affected by this
behavior, especially those for which we do ngram just because of a lack of
good tokenizer.

That's why I'm not sure the default should be changed and I'm all for a
getter/setter. If however it turns out the default MUST be changed, then I
support the Version + getter/setter approach.

Shai

On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) wrote:

>
>[
> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410]
>
> Uwe Schindler commented on LUCENE-2458:
> ---
>
> Hi Robert,
>
> I also agree with Mark (as you know). We can have both:
> - Version for a good default (3.1 will get the new non-phrase-query
> behavior)
> - A separate getsetter for this option
> (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
>
> This would give you the best from both worlds.
>
> > queryparser shouldn't generate phrasequeries based on term count
> > 
> >
> > Key: LUCENE-2458
> > URL: https://issues.apache.org/jira/browse/LUCENE-2458
> > Project: Lucene - Java
> >  Issue Type: Bug
> >  Components: QueryParser
> >Reporter: Robert Muir
> >Assignee: Robert Muir
> >Priority: Blocker
> > Fix For: 3.1, 4.0
> >
> > Attachments: LUCENE-2458.patch, LUCENE-2458.patch
> >
> >
> > The current method in the queryparser to generate phrasequeries is wrong:
> > The Query Syntax documentation (
> http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> > {noformat}
> > A Phrase is a group of words surrounded by double quotes such as "hello
> dolly".
> > {noformat}
> > But as we know, this isn't actually true.
> > Instead the terms are first divided on whitespace, then the analyzer term
> count is used as some sort of "heuristic" to determine if its a phrase query
> or not.
> > This assumption is a disaster for languages that don't use whitespace
> separation: CJK, compounding European languages like German, Finnish, etc.
> It also
> > makes it difficult for people to use n-gram analysis techniques. In these
> cases you get bad relevance (MAP improves nearly *10x* if you use a
> PositionFilter at query-time to "turn this off" for chinese).
> > For even english, this undocumented behavior is bad. Perhaps in some
> cases its being abused as some heuristic to "second guess" the tokenizer and
> piece back things it shouldn't have split, but for large collections, doing
> things like generating phrasequeries because StandardTokenizer split a
> compound on a dash can cause serious performance problems. Instead people
> should analyze their text with the appropriate methods, and QueryParser
> should only generate phrase queries when the syntax asks for one.
> > The PositionFilter in contrib can be seen as a workaround, but its pretty
> obscure and people are not familiar with it. The result is we have bad
> out-of-box behavior for many languages, and bad performance for others on
> some inputs.
> > I propose instead that we change the grammar to actually look for double
> quotes to determine when to generate a phrase query, consistent with the
> documentation.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
Yes, in this issue there is a lot of unsubstantiated claims that this bug
helps with broken english tokenization.  I simply ask you to measure your
claim... the scoring is already broken in this case!

On May 23, 2010 9:42 AM, "Mark Miller"  wrote:

Obnoxiousness has certainly been in the air regarding this issue, I'll
give you that.


On Sunday, May 23, 2010, Robert Muir  wrote:
> I can't tell if you are being obno...

-- 
- Mark

http://www.lucidimagination.com

---...

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@...


Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Mark Miller
Obnoxiousness has certainly been in the air regarding this issue, I'll
give you that.

On Sunday, May 23, 2010, Robert Muir  wrote:
> I can't tell if you are being obnoxious or seriously believe what you say.  
> You understand that cjkanalyzer is broke with this? You understand that 
> ngrams themselves capture information about position and it even works nicely 
> with scoring, and helps.
>
> This hack doesn't help english.  If you think otherwise, be a man and show 
> real results
> On May 23, 2010 6:39 AM, "Shai Erera (JIRA)"  wrote:
>
>
> [ 
> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue...
> -
> To unsubscribe, e-mail: dev-un...
>

-- 
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
I can't tell if you are being obnoxious or seriously believe what you say.
You understand that cjkanalyzer is broke with this? You understand that
ngrams themselves capture information about position and it even works
nicely with scoring, and helps.

This hack doesn't help english.  If you think otherwise, be a man and show
real results

On May 23, 2010 6:39 AM, "Shai Erera (JIRA)"  wrote:


[
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue.
..

-
To unsubscribe, e-mail: dev-un...


RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Itamar Syn-Hershko
Again, this is not a hack, and that was exactly my point. As I said:

> resolving this is very simple, by just applying a correct logic 
> (ignore double-quotes followed by a char) which isn't enforced today 
> and once it will be, it won't cause any cases of unexpected behavior.

It is just valid for English queries to ignore double-quotes in mid-word
instead of tokenizing upon it if not followed by an empty char, as it is in
Hebrew.

Itamar. 

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Thursday, May 13, 2010 3:24 AM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

Internationalization doesn't work by just piling hacks for language X,
language Y, and language Z on top of each other.

Just like I want the English hack removed, I strongly recommend against
adding any Hebrew hack.

On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko 
wrote:
> I think we understand each other perfectly well. I still think 
> resolving this is very simple, by just applying a correct logic 
> (ignore double-quotes followed by a char) which isn't enforced today 
> and once it will be, it won't cause any cases of unexpected behavior. 
> This isn't an analysis related task, and I'm not sure what  makes you 
> insist so bad. I will be openning a dedicated JIRA ticket for this 
> discussion if this won't become part of the current one.
>
> Itamar.
>
> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Thursday, May 13, 2010 1:42 AM
> To: dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't 
> generate phrasequeries based on term count
>
> On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko 
> 
> wrote:
>> Never did I request the QP to do Analysis. I simply mentioned this 
>> bug
>> - what this definitely is -
>
> Its definitely not a bug for Hebrew, there is a unicode character for 
> gershayim (U+05F4), so technically this should be used according to
unicode.
>
> Its arguably your responsibility to convert your data to unicode 
> before passing it thru Lucene, and that includes disambiguating when a 
> double quote should be gershayim
>
> --
> Robert Muir
> rcm...@gmail.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For 
> additional commands, e-mail: dev-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For 
> additional commands, e-mail: dev-h...@lucene.apache.org
>
>



--
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir
Internationalization doesn't work by just piling hacks for language X,
language Y, and language Z on top of each other.

Just like I want the English hack removed, I strongly recommend
against adding any Hebrew hack.

On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko  wrote:
> I think we understand each other perfectly well. I still think resolving
> this is very simple, by just applying a correct logic (ignore double-quotes
> followed by a char) which isn't enforced today and once it will be, it won't
> cause any cases of unexpected behavior. This isn't an analysis related task,
> and I'm not sure what  makes you insist so bad. I will be openning a
> dedicated JIRA ticket for this discussion if this won't become part of the
> current one.
>
> Itamar.
>
> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Thursday, May 13, 2010 1:42 AM
> To: dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
> phrasequeries based on term count
>
> On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko 
> wrote:
>> Never did I request the QP to do Analysis. I simply mentioned this bug
>> - what this definitely is -
>
> Its definitely not a bug for Hebrew, there is a unicode character for
> gershayim (U+05F4), so technically this should be used according to unicode.
>
> Its arguably your responsibility to convert your data to unicode before
> passing it thru Lucene, and that includes disambiguating when a double quote
> should be gershayim
>
> --
> Robert Muir
> rcm...@gmail.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Itamar Syn-Hershko
I think we understand each other perfectly well. I still think resolving
this is very simple, by just applying a correct logic (ignore double-quotes
followed by a char) which isn't enforced today and once it will be, it won't
cause any cases of unexpected behavior. This isn't an analysis related task,
and I'm not sure what  makes you insist so bad. I will be openning a
dedicated JIRA ticket for this discussion if this won't become part of the
current one.
 
Itamar.

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Thursday, May 13, 2010 1:42 AM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko 
wrote:
> Never did I request the QP to do Analysis. I simply mentioned this bug 
> - what this definitely is -

Its definitely not a bug for Hebrew, there is a unicode character for
gershayim (U+05F4), so technically this should be used according to unicode.

Its arguably your responsibility to convert your data to unicode before
passing it thru Lucene, and that includes disambiguating when a double quote
should be gershayim

--
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir
On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko  wrote:
> Never did I request the QP to do Analysis. I simply mentioned this bug -
> what this definitely is -

Its definitely not a bug for Hebrew, there is a unicode character for
gershayim (U+05F4), so technically this should be used according to
unicode.

Its arguably your responsibility to convert your data to unicode
before passing it thru Lucene, and that includes disambiguating when a
double quote should be gershayim

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Itamar Syn-Hershko
Never did I request the QP to do Analysis. I simply mentioned this bug -
what this definitely is - so you could tackle it while you're at it. This is
an definitely relevant to a discussion about re-making how the QP determines
what is a legit PhraseQuery and what is not.

The fix is quite easy I believe - just make sure you don't identify a
double-quote as a trigger for starting or ending a phrase unless it is
followed by a white-space (or another non-char). An English query like
'Foo"bar"' (with no enclosing quotes...) is invalid anyway (although it is
not handled as such at the moment).

I cannot handle this on the application side, simply because there the
double-quote char is NOT a special character. As I mentioned, for Hebrew it
is part of the word, pretty much like Niqqud is. If the user has entered a
textual query with an acronym, there's no point in me parsing it once just
to escape what I suspect are acronyms and then send it to the core QP, or
just create the queries by myself. All this being valid in light of my
second paragraph in this message - the fix is easy and also correct for the
basic, non-Hebrew, implementation.

Itamar.

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Wednesday, May 12, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko 
wrote:
> The QueryParser also fails to correctly parse Hebrew acronyms; 
> although not being an integral part of the current discussion, I 
> thought this would be the best place to bring that up.
>

Just as I don't think Analysis should do QueryParsing, I don't think
QueryParsing should do Analysis either.
Similar problems to this exist in other languages (I have to escape :
for some, because lucene wants to interpret it as a field name).

But this can be easily remedied on the application side, its documented and
understood that the double-quote is a special character, and there is an
escape mechanism so you can escape the ones you think are acronyms.

This issue is about about a buggy implementation: its not documented and
only internal to how the queryparser determines what is a phrase query or
not (and, contrary to what you would believe from the documentation, the
choice of whether or not to make a PhraseQuery is not based on syntax one
bit!)

--
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Mark Miller

On 5/12/10 11:24 AM, Robert Muir wrote:

On Wed, May 12, 2010 at 11:16 AM, Mark Miller  wrote:


Thats a major exaggeration - quoting text plays a large role in whether or
not you will get a phrase query.



No, it has nothing to do with it in the implementation. It only
"escapes the whitespace", but is discarded. This is clear from looking
at the grammar.

The logic then to determine if you get a phrase query is the huge mess
of code in getFieldQuery, but its not based on the double quotes at
all.

For example a list of chinese or thai words gets a phrase query, only
because they don't use whitespace between words.
But a similar list of english words gets a boolean query.



Quotes play a part, or quoting something would simply not create a 
phrase query - quoting something ensures that it hits the analyzer as 
one chunk, rather than getting meta parsed by the grammar and fed to the 
analyzer a token at a time. This ensures that multiple tokens hit the 
funky logic to create a phrase query. The grammar specifically looks for 
quoted chunks.


--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir
On Wed, May 12, 2010 at 11:16 AM, Mark Miller  wrote:
>
> Thats a major exaggeration - quoting text plays a large role in whether or
> not you will get a phrase query.
>

No, it has nothing to do with it in the implementation. It only
"escapes the whitespace", but is discarded. This is clear from looking
at the grammar.

The logic then to determine if you get a phrase query is the huge mess
of code in getFieldQuery, but its not based on the double quotes at
all.

For example a list of chinese or thai words gets a phrase query, only
because they don't use whitespace between words.
But a similar list of english words gets a boolean query.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Mark Miller

On 5/12/10 9:25 AM, Robert Muir wrote:

(and, contrary to what you would believe from the
documentation, the choice of whether or not to make a PhraseQuery is
not based on syntax one bit!)



Thats a major exaggeration - quoting text plays a large role in whether 
or not you will get a phrase query.



--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir
On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko  wrote:
> The QueryParser also fails to correctly parse Hebrew acronyms; although not
> being an integral part of the current discussion, I thought this would be
> the best place to bring that up.
>

Just as I don't think Analysis should do QueryParsing, I don't think
QueryParsing should do Analysis either.
Similar problems to this exist in other languages (I have to escape :
for some, because lucene wants to interpret it as a field name).

But this can be easily remedied on the application side, its
documented and understood that the double-quote is a special
character, and there is an escape mechanism so you can escape the ones
you think are acronyms.

This issue is about about a buggy implementation: its not documented
and only internal to how the queryparser determines what is a phrase
query or not (and, contrary to what you would believe from the
documentation, the choice of whether or not to make a PhraseQuery is
not based on syntax one bit!)

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Itamar Syn-Hershko
The QueryParser also fails to correctly parse Hebrew acronyms; although not
being an integral part of the current discussion, I thought this would be
the best place to bring that up.

Hebrew acronyms are assembled of letters with a single double-quote char
within, example: MNK"L (Hebrew for CEO). That double-quote char usually
comes at the before-last position of the word, but for some cases it can
come before (MNK"LIT). Since the QP expects two sets of double-quotes
enclosing a phrase, an exception will be thrown if such a word has been
passed to it, or an incorrect phrase query will be produced if two acronyms
are used together in a query string. Not sure which is worse.

Perhaps while you're at it you could make sure to only create a phrase query
if a quote is followed by a space - hence is definitely at the end of a
word, and not just assume it to be equivalent to a white space?

Although there's no good open Hebrew analyzer for Lucene yet hence no
motivation for this to be fixed, I'm working on one as we speak and
hopefully will have something to show in the next few weeks/days. It would
be nice to have at least this issue closed within the Lucene core code.

Thanks,

Itamar Syn-Hershko


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org