RE: DisjunctionMaxQuery and scoring

2012-04-19 Thread Uwe Schindler
Hi,

Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To
achieve this, you have to change the coord function in your
similarity/BooleanWeight used for this query.

Either way: If you want a group of terms that get only one score if at least
one of the terms match (SQL IN), but not add them at all,
DisjunctionMaxQuery is fine. I think this is what Benson asked for.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Friday, April 20, 2012 8:16 AM
> To: java-user@lucene.apache.org; david_murgatr...@hotmail.com
> Subject: RE: DisjunctionMaxQuery and scoring
> 
> Hi,
> > I think
> >  BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish
> > the desired "name IN (dick, rich)" scoring behavior. This is because
> (name:dick |
> > name:rich) with coord=false would score the 'document' "Dick Rich"
> > higher than "Rich" because the former has two term matches and the
> > latter only
> one.
> > In contrast, I think the desire is that one and only one of the terms
> > in
> the
> > document match those in the BooleanQuery so that "Rich" would score
> > higher than "Dick Rich", given document length normalization. It's
> > almost like a
> desire
> > for BooleanQuery bq = new BooleanQuery(false);
> >   bq.set*Maximum*NumberShouldMatch(1);
> 
> I that case DisjunctionMaxQuery is the way to go (it will only count the
hit with
> highest score and not add scores (coord or not coord doesn't matter here).
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd  wrote:
> In contrast, I think the desire
> is that one and only one of the terms in the document match those in the
> BooleanQuery so that "Rich" would score higher than "Dick Rich", given
> document length normalization. It's almost like a desire for
> BooleanQuery bq = new BooleanQuery(false);
>  bq.set*Maximum*NumberShouldMatch(1);
>

you can, by returning a customized weight with a coord impl that
PUNISHES documents that match > 1 sub.

Take a look at 
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/queries/src/java/org/apache/lucene/queries/BoostingQuery.java
for some inspiration, especially this part:

BooleanQuery result = new BooleanQuery() {
@Override
public Weight createWeight(IndexSearcher searcher) throws IOException {
  return new BooleanWeight(searcher, false) {

@Override
public float coord(int overlap, int max) {
  // your logic here when overlap == 1, > 1, etc

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: DisjunctionMaxQuery and scoring

2012-04-19 Thread Uwe Schindler
Hi,
> I think
>  BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the
> desired "name IN (dick, rich)" scoring behavior. This is because
(name:dick |
> name:rich) with coord=false would score the 'document' "Dick Rich" higher
> than "Rich" because the former has two term matches and the latter only
one.
> In contrast, I think the desire is that one and only one of the terms in
the
> document match those in the BooleanQuery so that "Rich" would score higher
> than "Dick Rich", given document length normalization. It's almost like a
desire
> for BooleanQuery bq = new BooleanQuery(false);
>   bq.set*Maximum*NumberShouldMatch(1);

I that case DisjunctionMaxQuery is the way to go (it will only count the hit
with highest score and not add scores (coord or not coord doesn't matter
here).


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
FWIW, there seems to be an explain bug in 2.9.1 that is fixed in
3.6.0, so I'm no longer confused about the actual behavior.


On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd  wrote:
> [apologies for the earlier errant send]
>
> I think
>  BooleanQuery bq = new BooleanQuery(false);
> doesn't quite accomplish the desired "name IN (dick, rich)" scoring
> behavior. This is because (name:dick | name:rich) with coord=false would
> score the 'document' "Dick Rich" higher than "Rich" because the former has
> two term matches and the latter only one. In contrast, I think the desire
> is that one and only one of the terms in the document match those in the
> BooleanQuery so that "Rich" would score higher than "Dick Rich", given
> document length normalization. It's almost like a desire for
> BooleanQuery bq = new BooleanQuery(false);
>  bq.set*Maximum*NumberShouldMatch(1);
>
> Is there a good way to accomplish this?
>
> On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir  wrote:
>
>> On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies 
>> wrote:
>> > I see why I'm so confused, but I think I need to construct a simpler
>> test case.
>> >
>> > My top-level BooleanQuery, which has disableCoord=false, has 22
>> > clauses. All but three are ordinary SHOULD TermQueries. the remainder
>> > are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
>> > (that's a bug).
>> >
>> > However, at the end of the explain trace, I see:
>> >
>> > 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
>> > flipping coord on and off to see what happens, is somehow not
>> > participating at all. So switching it's coord on and off has no
>> > effect.
>> >
>> > Why 20? Why not 22? Is this just an explain quirk?
>>
>> I am not sure (also not sure i understand your example totally), but
>> at the same time could be as simple as the fact you have 2 prohibited
>> (MUST_NOT) clauses. These don't count towards coord()
>>
>> I think its hard to tell from your description (just since it doesn't
>> have all the details). an explain or test case or something like that
>> would might be more efficient if its still not making sense...
>>
>> --
>> lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread David Murgatroyd
[apologies for the earlier errant send]

I think
 BooleanQuery bq = new BooleanQuery(false);
doesn't quite accomplish the desired "name IN (dick, rich)" scoring
behavior. This is because (name:dick | name:rich) with coord=false would
score the 'document' "Dick Rich" higher than "Rich" because the former has
two term matches and the latter only one. In contrast, I think the desire
is that one and only one of the terms in the document match those in the
BooleanQuery so that "Rich" would score higher than "Dick Rich", given
document length normalization. It's almost like a desire for
BooleanQuery bq = new BooleanQuery(false);
  bq.set*Maximum*NumberShouldMatch(1);

Is there a good way to accomplish this?

On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir  wrote:

> On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies 
> wrote:
> > I see why I'm so confused, but I think I need to construct a simpler
> test case.
> >
> > My top-level BooleanQuery, which has disableCoord=false, has 22
> > clauses. All but three are ordinary SHOULD TermQueries. the remainder
> > are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> > (that's a bug).
> >
> > However, at the end of the explain trace, I see:
> >
> > 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> > flipping coord on and off to see what happens, is somehow not
> > participating at all. So switching it's coord on and off has no
> > effect.
> >
> > Why 20? Why not 22? Is this just an explain quirk?
>
> I am not sure (also not sure i understand your example totally), but
> at the same time could be as simple as the fact you have 2 prohibited
> (MUST_NOT) clauses. These don't count towards coord()
>
> I think its hard to tell from your description (just since it doesn't
> have all the details). an explain or test case or something like that
> would might be more efficient if its still not making sense...
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies  wrote:
> I see why I'm so confused, but I think I need to construct a simpler test 
> case.
>
> My top-level BooleanQuery, which has disableCoord=false, has 22
> clauses. All but three are ordinary SHOULD TermQueries. the remainder
> are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> (that's a bug).
>
> However, at the end of the explain trace, I see:
>
> 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> flipping coord on and off to see what happens, is somehow not
> participating at all. So switching it's coord on and off has no
> effect.
>
> Why 20? Why not 22? Is this just an explain quirk?

I am not sure (also not sure i understand your example totally), but
at the same time could be as simple as the fact you have 2 prohibited
(MUST_NOT) clauses. These don't count towards coord()

I think its hard to tell from your description (just since it doesn't
have all the details). an explain or test case or something like that
would might be more efficient if its still not making sense...

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread David Murgatroyd




On Apr 19, 2012, at 6:36 PM, Benson Margulies  wrote:

> I see why I'm so confused, but I think I need to construct a simpler test 
> case.
> 
> My top-level BooleanQuery, which has disableCoord=false, has 22
> clauses. All but three are ordinary SHOULD TermQueries. the remainder
> are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> (that's a bug).
> 
> However, at the end of the explain trace, I see:
> 
> 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> flipping coord on and off to see what happens, is somehow not
> participating at all. So switching it's coord on and off has no
> effect.
> 
> Why 20? Why not 22? Is this just an explain quirk? Should I shove all
> this code up to 3.6 from 2.9.3 before bugging you further?
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I see why I'm so confused, but I think I need to construct a simpler test case.

My top-level BooleanQuery, which has disableCoord=false, has 22
clauses. All but three are ordinary SHOULD TermQueries. the remainder
are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
(that's a bug).

However, at the end of the explain trace, I see:

0.45 = coord(9/20) I think that my nested Boolean, for which I've been
flipping coord on and off to see what happens, is somehow not
participating at all. So switching it's coord on and off has no
effect.

Why 20? Why not 22? Is this just an explain quirk? Should I shove all
this code up to 3.6 from 2.9.3 before bugging you further?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Two questions on RussianAnalyzer

2012-04-19 Thread Vladimir Gubarkov
Thank you Steven,

I'll look into this

On Fri, Apr 20, 2012 at 12:43 AM, Steven A Rowe  wrote:
> Hi Vladimir,
>
>> The most uncomfortable in new behaviour to me is that in past I used
>> to search by subdomain like bbb.com: and have displayed results
>> with www.bbb.com:, aaa.bbb.com: and so on. Now I have 0
>> results.
>
> About domain names, see my response to a similar question today on the Solr 
> users list: .
>
> Steve
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 5:10 PM, Robert Muir  wrote:
> On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies  
> wrote:
>> On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir  wrote:
>>> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies  
>>> wrote:
 On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir  wrote:
> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies  
> wrote:
>> I am trying to solve a problem using DisjunctionMaxQuery.
>>
>>
>> Consider a query like:
>>
>> a:b OR c:d OR e:f OR ...
>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>
>> At most, one of the richard names matches. So the match score gets
>> dragged down by the long list of things that don't match, as the list
>> can get quite long.
>>
>> It seemed to me, upon reading the documentation, that I could cure
>> this problem by creating a query tree that used DisjunctionMaxQuery
>> around all those nicknames. However, when I built a boolean query that
>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>> these individual Term queries, the score and the explanation did not
>> change at all -- in particular, the coord term shows the same number
>> of total terms. So it looks as if the children of the disjunction
>> still count.
>>
>> Is there a way to control that term? Or a better way to express this?
>> Thinking SQL for a moment, what I'm trying to express is
>>
>>   name IN (richard, dick, dickie, rich)
>>
>
> I think you just want to disable coord() here? You can do this for
> that particular boolean query by passing true to the ctor:
>
>  public BooleanQuery(boolean disableCoord)

 Rob,

 How do nested queries work with respect to this? If I build a boolean
 query one of whose clauses is a BooleanQuery with coord turned off,
 does just the nested query insides get left out of 'coord'?

 If so, then your answer certainly seems to be what the doctor ordered.

>>>
>>> it applies only to that query itself. So if this BQ is a clause to
>>> another BQ that has coord enabled,
>>> that would not change the top-level BQ's coord.
>>>
>>> Note: if you don't want coord at all, then you can also plug in a
>>> Similarity that returns 1,
>>> or pick another Similarity like BM25: in trunk only the vector space
>>> impl even does anything for coord()
>>
>> Robert, I'm sorry that my density is approaching lead. My problem is
>> that I want coord, but I want to control which terms are counted and
>> which are not. I suppose I can accomplish this with my own scorer. My
>> hope was that there was a way to express "This group of terms counts
>> as one for coord".
>
> So just structure your boolean query appropriately?
>
> BQ1(coord=true)
>  BQ2(coord=false): 25 terms
>  BQ3(coord=false): 87 terms
>
> BQ1's coord is based on how many subscorers match (out of 2, BQ2 and
> BQ3). If both match its 2/2 otherwise 1/2.
>
> But in this example BQ2 and BQ3 disable coord themselves, hiding the
> fact they accept 25 and 87 terms respectively and appearing as a
> single sub for coord().
>
> Does this make sense? you can extend this idea to control this however
> you want by structuring the BQ appropriately so your BQ's with
> "synonyms" have coord=0

Robert,

This makes perfect sense, it is what I thought you meant to begin
with. I tried it and thought that it did not work. Or, perhaps, I am
misreading the 'explain' output. Or, more likely, I goofed altogether.
I'll go back and recheck my results and post some explain output if I
can't find my mistake.

--benson




>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies  wrote:
> On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir  wrote:
>> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies  
>> wrote:
>>> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir  wrote:
 On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies  
 wrote:
> I am trying to solve a problem using DisjunctionMaxQuery.
>
>
> Consider a query like:
>
> a:b OR c:d OR e:f OR ...
> name:richard OR name:dick OR name:dickie OR name:rich ...
>
> At most, one of the richard names matches. So the match score gets
> dragged down by the long list of things that don't match, as the list
> can get quite long.
>
> It seemed to me, upon reading the documentation, that I could cure
> this problem by creating a query tree that used DisjunctionMaxQuery
> around all those nicknames. However, when I built a boolean query that
> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
> these individual Term queries, the score and the explanation did not
> change at all -- in particular, the coord term shows the same number
> of total terms. So it looks as if the children of the disjunction
> still count.
>
> Is there a way to control that term? Or a better way to express this?
> Thinking SQL for a moment, what I'm trying to express is
>
>   name IN (richard, dick, dickie, rich)
>

 I think you just want to disable coord() here? You can do this for
 that particular boolean query by passing true to the ctor:

  public BooleanQuery(boolean disableCoord)
>>>
>>> Rob,
>>>
>>> How do nested queries work with respect to this? If I build a boolean
>>> query one of whose clauses is a BooleanQuery with coord turned off,
>>> does just the nested query insides get left out of 'coord'?
>>>
>>> If so, then your answer certainly seems to be what the doctor ordered.
>>>
>>
>> it applies only to that query itself. So if this BQ is a clause to
>> another BQ that has coord enabled,
>> that would not change the top-level BQ's coord.
>>
>> Note: if you don't want coord at all, then you can also plug in a
>> Similarity that returns 1,
>> or pick another Similarity like BM25: in trunk only the vector space
>> impl even does anything for coord()
>
> Robert, I'm sorry that my density is approaching lead. My problem is
> that I want coord, but I want to control which terms are counted and
> which are not. I suppose I can accomplish this with my own scorer. My
> hope was that there was a way to express "This group of terms counts
> as one for coord".

So just structure your boolean query appropriately?

BQ1(coord=true)
  BQ2(coord=false): 25 terms
  BQ3(coord=false): 87 terms

BQ1's coord is based on how many subscorers match (out of 2, BQ2 and
BQ3). If both match its 2/2 otherwise 1/2.

But in this example BQ2 and BQ3 disable coord themselves, hiding the
fact they accept 25 and 87 terms respectively and appearing as a
single sub for coord().

Does this make sense? you can extend this idea to control this however
you want by structuring the BQ appropriately so your BQ's with
"synonyms" have coord=0

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir  wrote:
> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies  
> wrote:
>> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir  wrote:
>>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies  
>>> wrote:
 I am trying to solve a problem using DisjunctionMaxQuery.


 Consider a query like:

 a:b OR c:d OR e:f OR ...
 name:richard OR name:dick OR name:dickie OR name:rich ...

 At most, one of the richard names matches. So the match score gets
 dragged down by the long list of things that don't match, as the list
 can get quite long.

 It seemed to me, upon reading the documentation, that I could cure
 this problem by creating a query tree that used DisjunctionMaxQuery
 around all those nicknames. However, when I built a boolean query that
 had, as a clause, a DisjunctionMaxQuery in the place of a pile of
 these individual Term queries, the score and the explanation did not
 change at all -- in particular, the coord term shows the same number
 of total terms. So it looks as if the children of the disjunction
 still count.

 Is there a way to control that term? Or a better way to express this?
 Thinking SQL for a moment, what I'm trying to express is

   name IN (richard, dick, dickie, rich)

>>>
>>> I think you just want to disable coord() here? You can do this for
>>> that particular boolean query by passing true to the ctor:
>>>
>>>  public BooleanQuery(boolean disableCoord)
>>
>> Rob,
>>
>> How do nested queries work with respect to this? If I build a boolean
>> query one of whose clauses is a BooleanQuery with coord turned off,
>> does just the nested query insides get left out of 'coord'?
>>
>> If so, then your answer certainly seems to be what the doctor ordered.
>>
>
> it applies only to that query itself. So if this BQ is a clause to
> another BQ that has coord enabled,
> that would not change the top-level BQ's coord.
>
> Note: if you don't want coord at all, then you can also plug in a
> Similarity that returns 1,
> or pick another Similarity like BM25: in trunk only the vector space
> impl even does anything for coord()

Robert, I'm sorry that my density is approaching lead. My problem is
that I want coord, but I want to control which terms are counted and
which are not. I suppose I can accomplish this with my own scorer. My
hope was that there was a way to express "This group of terms counts
as one for coord".

In other words, for a subset of fields in the query, I want to scale
the entire score by the fraction of them that match.

Another way to think about this, which might be no use at all, is to
wonder: is there a way to charge a score penalty for failure to match
a particular query term? That would, from another direction, address
the underlying effect I'm trying to get.



>
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Two questions on RussianAnalyzer

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 4:51 PM, Vladimir Gubarkov  wrote:
> So it's now imposible to find this document with query: "site.com".
> I'm having an RSS subscription for that search, and now it's broken.
>

Just to point out, its not impossible, as i suggested before, if you
were happy with the old tokenizer and you dont like passing LUCENE_30,
then just make your own analyzer, using ClassicTokenizer +  instead.



-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Two questions on RussianAnalyzer

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 4:51 PM, Vladimir Gubarkov  wrote:

> Hmmm... I know this and I reindexed!
> I'll try to explain the problem (fortunately, already solved by using
> LUCENE_30) ones again:
> When indexing with new analyzer the whole lexeme "some.cool.site.com"
> goes to index, not 4 lexems "some", "cool", "site", "com".
> So it's now imposible to find this document with query: "site.com".
> I'm having an RSS subscription for that search, and now it's broken.
>

You are right, to search for prefixes of that (assuming its a URL,
which it may or may not be, it depends on the domain and use case),
you need something else. So I think Steven Rowe's advice is best here.


-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Two questions on RussianAnalyzer

2012-04-19 Thread Vladimir Gubarkov
Thank you Robert for detailed reply

On Fri, Apr 20, 2012 at 12:37 AM, Robert Muir  wrote:
> On Thu, Apr 19, 2012 at 7:26 AM, Vladimir Gubarkov  wrote:
>> New analyzer:
>> [aaa.bbb.com, , a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
>> r, s, t, u, v, z, y, z]
>> Old analyzer:
>> [aaa, bbb, com, , a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
>> q, r, s, t, u, v, z, y, z]
>>
>> Please note the differences.
>
> Right, the tokenizer has changed. This is mentioned in the javadocs:
> http://lucene.apache.org/core/3_6_0/api/contrib-analyzers/org/apache/lucene/analysis/ru/RussianAnalyzer.html
>
>>
>> The most uncomfortable in new behaviour to me is that in past I used
>> to search by subdomain like
>> bbb.com:
>> and have displayed results with www.bbb.com:, aaa.bbb.com: and
>> so on. Now I have 0 results.
>
> Don't simply set your version parameter to 3.6 without reindexing.
> This is really important!!!
> Otherwise it defeats the whole purpose.
>

Hmmm... I know this and I reindexed!
I'll try to explain the problem (fortunately, already solved by using
LUCENE_30) ones again:
When indexing with new analyzer the whole lexeme "some.cool.site.com"
goes to index, not 4 lexems "some", "cool", "site", "com".
So it's now imposible to find this document with query: "site.com".
I'm having an RSS subscription for that search, and now it's broken.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Two questions on RussianAnalyzer

2012-04-19 Thread Steven A Rowe
Hi Vladimir,

> The most uncomfortable in new behaviour to me is that in past I used
> to search by subdomain like bbb.com: and have displayed results
> with www.bbb.com:, aaa.bbb.com: and so on. Now I have 0
> results.

About domain names, see my response to a similar question today on the Solr 
users list: . 

Steve



Re: Two questions on RussianAnalyzer

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 7:26 AM, Vladimir Gubarkov  wrote:
> New analyzer:
> [aaa.bbb.com, , a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
> r, s, t, u, v, z, y, z]
> Old analyzer:
> [aaa, bbb, com, , a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
> q, r, s, t, u, v, z, y, z]
>
> Please note the differences.

Right, the tokenizer has changed. This is mentioned in the javadocs:
http://lucene.apache.org/core/3_6_0/api/contrib-analyzers/org/apache/lucene/analysis/ru/RussianAnalyzer.html

>
> The most uncomfortable in new behaviour to me is that in past I used
> to search by subdomain like
> bbb.com:
> and have displayed results with www.bbb.com:, aaa.bbb.com: and
> so on. Now I have 0 results.

Don't simply set your version parameter to 3.6 without reindexing.
This is really important!!!
Otherwise it defeats the whole purpose.

>
> My questions are: 1) it this change is by design (not a mistake) and
> 2) is the only option to achieve old behaviour is to use
> Version.LUCENE_30 for creating analyzer?

No, this analyzer is just an example. You can always easily make your
own analyzer (just extend ReusableAnalyzerBase)
with maybe 6 or 7 lines of code to combine whatever Tokenizer
(ClassicTokenizer, UAX29URLEmail, StandardTokenizer, whatever),
along with any combination of filters (such as stemmers, whatever)
that you want.

>
> The other problem with RussionAnalyzer is with the letter Yo
> http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often
> replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and
> such words are considered same.
> What I want to achieve is that my search by word with yo also yield
> words with this letter replaced to ye (and vice-versa).
>
> What I'm currently doing is roughly next:
>
> // NOTE: I have to define my class in this package, because method
> russianAnalyzer.createComponents is protected
> package org.apache.lucene.analysis.ru;

Don't try to so hard to extend existing analyzers. Just make your own.
They are just examples.

>
> public class YoCharFilter extends CharFilter {

CharFilter is really mostly in case you need to correct offsets too
for highlighting and such before the tokenizer.

But you dont, this is a simple 1-1 mapping that won't affect
tokenization. Its just a trivial normalization. I would use a
tokenfilter instead.

> Maybe it may have sense to add a configuration option to
> RussianAnalyzer itself (distinguish or not yo & ye)?

I dont think so, there are tons of choices here (for example we
provide 2 stemming options for russian, and more exist), and also
even this normalization is complicated,  for example some people might
have documents where russian ye is actually latin 'e' and so on.
I've seen it, so i know it exists.

So we generally just provide XYZ_Analyzer as an example mostly, it
would be a lot for us to add every possible use case as an option
to every possible Analyzer. Instead just make your own!

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies  wrote:
> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir  wrote:
>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies  
>> wrote:
>>> I am trying to solve a problem using DisjunctionMaxQuery.
>>>
>>>
>>> Consider a query like:
>>>
>>> a:b OR c:d OR e:f OR ...
>>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>>
>>> At most, one of the richard names matches. So the match score gets
>>> dragged down by the long list of things that don't match, as the list
>>> can get quite long.
>>>
>>> It seemed to me, upon reading the documentation, that I could cure
>>> this problem by creating a query tree that used DisjunctionMaxQuery
>>> around all those nicknames. However, when I built a boolean query that
>>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>>> these individual Term queries, the score and the explanation did not
>>> change at all -- in particular, the coord term shows the same number
>>> of total terms. So it looks as if the children of the disjunction
>>> still count.
>>>
>>> Is there a way to control that term? Or a better way to express this?
>>> Thinking SQL for a moment, what I'm trying to express is
>>>
>>>   name IN (richard, dick, dickie, rich)
>>>
>>
>> I think you just want to disable coord() here? You can do this for
>> that particular boolean query by passing true to the ctor:
>>
>>  public BooleanQuery(boolean disableCoord)
>
> Rob,
>
> How do nested queries work with respect to this? If I build a boolean
> query one of whose clauses is a BooleanQuery with coord turned off,
> does just the nested query insides get left out of 'coord'?
>
> If so, then your answer certainly seems to be what the doctor ordered.
>

it applies only to that query itself. So if this BQ is a clause to
another BQ that has coord enabled,
that would not change the top-level BQ's coord.

Note: if you don't want coord at all, then you can also plug in a
Similarity that returns 1,
or pick another Similarity like BM25: in trunk only the vector space
impl even does anything for coord()


-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Two questions on RussianAnalyzer

2012-04-19 Thread Vladimir Gubarkov
On Thu, Apr 19, 2012 at 7:57 PM, Uwe Schindler  wrote:
>> My questions are: 1) it this change is by design (not a mistake) and
>> 2) is the only option to achieve old behaviour is to use
>> Version.LUCENE_30 for creating analyzer?
>
> This is why this option is there!

Right and it's great, but this not answers my questions, actually =).

>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
Turning on disableCoord for a nested boolean query does not seem to
change the overall maxCoord term as displayed in explain.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir  wrote:
> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies  
> wrote:
>> I am trying to solve a problem using DisjunctionMaxQuery.
>>
>>
>> Consider a query like:
>>
>> a:b OR c:d OR e:f OR ...
>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>
>> At most, one of the richard names matches. So the match score gets
>> dragged down by the long list of things that don't match, as the list
>> can get quite long.
>>
>> It seemed to me, upon reading the documentation, that I could cure
>> this problem by creating a query tree that used DisjunctionMaxQuery
>> around all those nicknames. However, when I built a boolean query that
>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>> these individual Term queries, the score and the explanation did not
>> change at all -- in particular, the coord term shows the same number
>> of total terms. So it looks as if the children of the disjunction
>> still count.
>>
>> Is there a way to control that term? Or a better way to express this?
>> Thinking SQL for a moment, what I'm trying to express is
>>
>>   name IN (richard, dick, dickie, rich)
>>
>
> I think you just want to disable coord() here? You can do this for
> that particular boolean query by passing true to the ctor:
>
>  public BooleanQuery(boolean disableCoord)

Rob,

How do nested queries work with respect to this? If I build a boolean
query one of whose clauses is a BooleanQuery with coord turned off,
does just the nested query insides get left out of 'coord'?

If so, then your answer certainly seems to be what the doctor ordered.

--benson


>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies  wrote:
> I am trying to solve a problem using DisjunctionMaxQuery.
>
>
> Consider a query like:
>
> a:b OR c:d OR e:f OR ...
> name:richard OR name:dick OR name:dickie OR name:rich ...
>
> At most, one of the richard names matches. So the match score gets
> dragged down by the long list of things that don't match, as the list
> can get quite long.
>
> It seemed to me, upon reading the documentation, that I could cure
> this problem by creating a query tree that used DisjunctionMaxQuery
> around all those nicknames. However, when I built a boolean query that
> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
> these individual Term queries, the score and the explanation did not
> change at all -- in particular, the coord term shows the same number
> of total terms. So it looks as if the children of the disjunction
> still count.
>
> Is there a way to control that term? Or a better way to express this?
> Thinking SQL for a moment, what I'm trying to express is
>
>   name IN (richard, dick, dickie, rich)
>

I think you just want to disable coord() here? You can do this for
that particular boolean query by passing true to the ctor:

  public BooleanQuery(boolean disableCoord)

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I am trying to solve a problem using DisjunctionMaxQuery.


Consider a query like:

a:b OR c:d OR e:f OR ...
name:richard OR name:dick OR name:dickie OR name:rich ...

At most, one of the richard names matches. So the match score gets
dragged down by the long list of things that don't match, as the list
can get quite long.

It seemed to me, upon reading the documentation, that I could cure
this problem by creating a query tree that used DisjunctionMaxQuery
around all those nicknames. However, when I built a boolean query that
had, as a clause, a DisjunctionMaxQuery in the place of a pile of
these individual Term queries, the score and the explanation did not
change at all -- in particular, the coord term shows the same number
of total terms. So it looks as if the children of the disjunction
still count.

Is there a way to control that term? Or a better way to express this?
Thinking SQL for a moment, what I'm trying to express is

   name IN (richard, dick, dickie, rich)

as a single term query. Reading the javadoc, I am seeing
MultiTermQuery, and I'm that it is what we want.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Two questions on RussianAnalyzer

2012-04-19 Thread Uwe Schindler
> My questions are: 1) it this change is by design (not a mistake) and
> 2) is the only option to achieve old behaviour is to use
> Version.LUCENE_30 for creating analyzer?

This is why this option is there!


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org