.net lucene doubt

2009-07-15 Thread m.harig

hello all ,

 am using .Net lucene for my search application , how do i index non
english pages ? Is there any analyzers to do it?? because am struggling with
utf8 problem , please any1 help me
-- 
View this message in context: 
http://www.nabble.com/.net-lucene-doubt-tp24510928p24510928.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



.net lucene doubt

2009-07-15 Thread m.harig

hello all ,

 am using .Net lucene for my search application , how do i index non
english pages ? Is there any analyzers to do it?? because am struggling with
utf8 problem , please any1 help me
-- 
View this message in context: 
http://www.nabble.com/.net-lucene-doubt-tp24510918p24510918.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

well, QA team is not there, and I am "abusing" cutomer's sysadmin, and it will 
cost me only a beer if I stop now :)

Will post traces tomorrow, daylight  does better ... I will have them done on 
trunk version (fixed two bugs) ...   



- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Thursday, 16 July, 2009 1:32:21
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Wed, Jul 15, 2009 at 7:13 PM, eks devwrote:
> 
> >>Are you sure when you ran the test you called
> >> setAllowDocsOutOfOrder(true)?
> >
> > right, just a second this is static... we have two indices, something 
> > runs 
> first and sets it to false... ouch, I hate statics... they make you beleive 
> you 
> can set them during construction...
> 
> Well, that method will be removed soon ;)
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
On Wed, Jul 15, 2009 at 7:13 PM, eks dev wrote:

>>Are you sure when you ran the test you called
>> setAllowDocsOutOfOrder(true)?
>
> right, just a second this is static... we have two indices, something 
> runs first and sets it to false... ouch, I hate statics... they make you 
> beleive you can set them during construction...

Well, that method will be removed soon ;)

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

warmduscher :)

good night 



- Original Message 
> From: Uwe Schindler 
> To: java-user@lucene.apache.org
> Sent: Thursday, 16 July, 2009 1:06:30
> Subject: RE: speed of BooleanQueries on 2.9
> 
> Same here, too late! Good night!
> And the blood glucose level is very low, too - very bad for such problems...
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Michael McCandless [mailto:luc...@mikemccandless.com]
> > Sent: Thursday, July 16, 2009 12:59 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: speed of BooleanQueries on 2.9
> > 
> > On Wed, Jul 15, 2009 at 6:52 PM, eks devwrote:
> > 
> > > Also not really expected, but this query runs over BS2, shouldn't  +(
> > whatewer whatever1...)  run as BS? what does it mean to have MUST +() at
> > the top level?
> > 
> > Your query is +(((X Y Z))^2).  In BQ.rewrite, any single-clause query
> > that hasn't had minNRShouldMatch set will return its single sub-query,
> > rewritten.  So your query should (recursively) rewrite to a simple
> > BooleanQuery (ie just OR'd terms), which is eligible for BS.
> > 
> > And we see BS in your hung stack trace ;)
> > 
> > >  it is a bit late here, I am going to bed ...
> > 
> > Good night!
> > 
> > Mike
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

>Are you sure when you ran the test you called
> setAllowDocsOutOfOrder(true)?

right, just a second this is static... we have two indices, something runs 
first and sets it to false... ouch, I hate statics... they make you beleive you 
can set them during construction... traces come in a couple of minutes.




 




- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Thursday, 16 July, 2009 0:50:28
> Subject: Re: speed of BooleanQueries on 2.9
> 
> I think that query should rewrite to a BQ that would in turn use BS.
> Are you sure when you ran the test you called
> setAllowDocsOutOfOrder(true)?
> 
> (How else can we explain that BS is in the "hung" stack trace, and
> that setAllowDocsOutOfOrder alters the behavior?)
> 
> Mike
> 
> On Wed, Jul 15, 2009 at 6:03 PM, eks devwrote:
> >
> > Mike's instrumented version is not printing anything on this query
> > and it works fine with trunk version
> >
> > BS2 gets executed (top Query Required... +((( )))?
> >
> > again the Query:
> > Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 
> NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352 
> NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682 
> NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552 
> NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408 
> NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682 
> NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632 
> NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682 
> NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408 
> NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632 
> NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682 
> NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523 
> NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002 
> NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922 
> NAME:beugarski^0.20281483 NAME:blacharski^0.1922
> >  NAME:lekarski^0.1922 NAME:pecarski^0.21294187 
> > NAME:peikarski^0.27648002 
> NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187 
> NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 NAME:pickarski^0.22073482 
> NAME:piekalski^0.23941332 NAME:piekanski^0.23941332 NAME:piekaraka^0.2255 
> NAME:piekarsci^0.29205337 NAME:piekarska^0.28421336 
> NAME:piekarskie^0.25392002 
> NAME:piekarsky^0.29205337 NAME:piekarzcyk^0.23232001 
> NAME:piekarzki^0.29205337 
> NAME:piekaski^0.24843001 NAME:piekavska^0.2255 NAME:piekorski^0.28421336 
> NAME:pielarski^0.22997928 NAME:pierarski^0.22997928 
> NAME:pierkarski^0.24661335 
> NAME:piesarski^0.22997928 NAME:pietarski^0.22997928 
> NAME:pietkarski^0.24661335 
> NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 NAME:pirkarski^0.22073482 
> NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 
> NAME:polikarski^0.20172001 
> NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 
> NAME:siekarski^0.20281483))^2.0)
> >
> >
> >
> >
> >
> >
> >
> >
> > - Original Message 
> >> From: eks dev 
> >> To: java-user@lucene.apache.org; yo...@lucidimagination.com
> >> Sent: Wednesday, 15 July, 2009 23:57:22
> >> Subject: Re: speed of BooleanQueries on 2.9
> >>
> >>
> >>
> >> it works with current trunk, 10 Minutes ago built?!
> >>
> >> if I put lucene from yesterday, the same symptoms like yesterday...
> >>
> >> Mike's instrumented version is running ...
> >>
> >>
> >>
> >> - Original Message 
> >> > From: Yonik Seeley
> >> > To: java-user@lucene.apache.org
> >> > Sent: Wednesday, 15 July, 2009 23:34:29
> >> > Subject: Re: speed of BooleanQueries on 2.9
> >> >
> >> > On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
> >> > > And the fix only affects custom DocIdSetIterators.
> >> >
> >> > And custom Queries (via Scorer) since Scorer inherits from DISI.
> >> > But as Mike says, it shouldn't be the issue behind in this thread.
> >> >
> >> > -Yonik
> >> > http://www.lucidimagination.com
> >> >
> >> > -
> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





---

RE: speed of BooleanQueries on 2.9

2009-07-15 Thread Uwe Schindler
Same here, too late! Good night!
And the blood glucose level is very low, too - very bad for such problems...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Thursday, July 16, 2009 12:59 AM
> To: java-user@lucene.apache.org
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Wed, Jul 15, 2009 at 6:52 PM, eks dev wrote:
> 
> > Also not really expected, but this query runs over BS2, shouldn't  +(
> whatewer whatever1...)  run as BS? what does it mean to have MUST +() at
> the top level?
> 
> Your query is +(((X Y Z))^2).  In BQ.rewrite, any single-clause query
> that hasn't had minNRShouldMatch set will return its single sub-query,
> rewritten.  So your query should (recursively) rewrite to a simple
> BooleanQuery (ie just OR'd terms), which is eligible for BS.
> 
> And we see BS in your hung stack trace ;)
> 
> >  it is a bit late here, I am going to bed ...
> 
> Good night!
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
On Wed, Jul 15, 2009 at 6:52 PM, eks dev wrote:

> Also not really expected, but this query runs over BS2, shouldn't  +( 
> whatewer whatever1...)  run as BS? what does it mean to have MUST +() at the 
> top level?

Your query is +(((X Y Z))^2).  In BQ.rewrite, any single-clause query
that hasn't had minNRShouldMatch set will return its single sub-query,
rewritten.  So your query should (recursively) rewrite to a simple
BooleanQuery (ie just OR'd terms), which is eligible for BS.

And we see BS in your hung stack trace ;)

>  it is a bit late here, I am going to bed ...

Good night!

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

I jut do not see how... 

Also not really expected, but this query runs over BS2, shouldn't  +( whatewer 
whatever1...)  run as BS? what does it mean to have MUST +() at the top level? 

 it is a bit late here, I am going to bed ...  

Thanks a lot to all involved! 
Eks



- Original Message 
> From: Uwe Schindler 
> To: java-user@lucene.apache.org; yo...@lucidimagination.com
> Sent: Thursday, 16 July, 2009 0:35:25
> Subject: RE: speed of BooleanQueries on 2.9
> 
> There is also this one: https://issues.apache.org/jira/browse/LUCENE-1744
> 
> Maybe this fixed this for Eks?
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
> > Seeley
> > Sent: Thursday, July 16, 2009 12:06 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: speed of BooleanQueries on 2.9
> > 
> > On Wed, Jul 15, 2009 at 5:57 PM, eks devwrote:
> > > it works with current trunk, 10 Minutes ago built?!
> > 
> > Hmmm, OK, maybe it was the DISI bug.
> > Do we have any Scorers in Lucene that forgot to implement advance()
> > and hence got the slow default version???
> > Not sure how to ask the IDE for that info...
> > 
> > -Yonik
> > http://www.lucidimagination.com
> > 
> > 
> > 
> > 
> > >
> > > if I put lucene from yesterday, the same symptoms like yesterday...
> > >
> > > Mike's instrumented version is running ...
> > >
> > >
> > >
> > > - Original Message 
> > >> From: Yonik Seeley 
> > >> To: java-user@lucene.apache.org
> > >> Sent: Wednesday, 15 July, 2009 23:34:29
> > >> Subject: Re: speed of BooleanQueries on 2.9
> > >>
> > >> On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
> > >> > And the fix only affects custom DocIdSetIterators.
> > >>
> > >> And custom Queries (via Scorer) since Scorer inherits from DISI.
> > >> But as Mike says, it shouldn't be the issue behind in this thread.
> > >>
> > >> -Yonik
> > >> http://www.lucidimagination.com
> > >>
> > >> -
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
I think that query should rewrite to a BQ that would in turn use BS.
Are you sure when you ran the test you called
setAllowDocsOutOfOrder(true)?

(How else can we explain that BS is in the "hung" stack trace, and
that setAllowDocsOutOfOrder alters the behavior?)

Mike

On Wed, Jul 15, 2009 at 6:03 PM, eks dev wrote:
>
> Mike's instrumented version is not printing anything on this query
> and it works fine with trunk version
>
> BS2 gets executed (top Query Required... +((( )))?
>
> again the Query:
> Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 
> NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352 
> NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682 
> NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552 
> NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408 
> NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682 
> NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632 
> NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682 
> NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408 
> NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632 
> NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682 
> NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523 
> NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002 
> NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922 
> NAME:beugarski^0.20281483 NAME:blacharski^0.1922
>  NAME:lekarski^0.1922 NAME:pecarski^0.21294187 NAME:peikarski^0.27648002 
> NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187 
> NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 NAME:pickarski^0.22073482 
> NAME:piekalski^0.23941332 NAME:piekanski^0.23941332 NAME:piekaraka^0.2255 
> NAME:piekarsci^0.29205337 NAME:piekarska^0.28421336 
> NAME:piekarskie^0.25392002 NAME:piekarsky^0.29205337 
> NAME:piekarzcyk^0.23232001 NAME:piekarzki^0.29205337 NAME:piekaski^0.24843001 
> NAME:piekavska^0.2255 NAME:piekorski^0.28421336 NAME:pielarski^0.22997928 
> NAME:pierarski^0.22997928 NAME:pierkarski^0.24661335 
> NAME:piesarski^0.22997928 NAME:pietarski^0.22997928 
> NAME:pietkarski^0.24661335 NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 
> NAME:pirkarski^0.22073482 NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 
> NAME:polikarski^0.20172001 NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 
> NAME:siekarski^0.20281483))^2.0)
>
>
>
>
>
>
>
>
> - Original Message 
>> From: eks dev 
>> To: java-user@lucene.apache.org; yo...@lucidimagination.com
>> Sent: Wednesday, 15 July, 2009 23:57:22
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>>
>>
>> it works with current trunk, 10 Minutes ago built?!
>>
>> if I put lucene from yesterday, the same symptoms like yesterday...
>>
>> Mike's instrumented version is running ...
>>
>>
>>
>> - Original Message 
>> > From: Yonik Seeley
>> > To: java-user@lucene.apache.org
>> > Sent: Wednesday, 15 July, 2009 23:34:29
>> > Subject: Re: speed of BooleanQueries on 2.9
>> >
>> > On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
>> > > And the fix only affects custom DocIdSetIterators.
>> >
>> > And custom Queries (via Scorer) since Scorer inherits from DISI.
>> > But as Mike says, it shouldn't be the issue behind in this thread.
>> >
>> > -Yonik
>> > http://www.lucidimagination.com
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: speed of BooleanQueries on 2.9

2009-07-15 Thread Uwe Schindler
There is also this one: https://issues.apache.org/jira/browse/LUCENE-1744

Maybe this fixed this for Eks?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: Thursday, July 16, 2009 12:06 AM
> To: java-user@lucene.apache.org
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Wed, Jul 15, 2009 at 5:57 PM, eks dev wrote:
> > it works with current trunk, 10 Minutes ago built?!
> 
> Hmmm, OK, maybe it was the DISI bug.
> Do we have any Scorers in Lucene that forgot to implement advance()
> and hence got the slow default version???
> Not sure how to ask the IDE for that info...
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> 
> >
> > if I put lucene from yesterday, the same symptoms like yesterday...
> >
> > Mike's instrumented version is running ...
> >
> >
> >
> > - Original Message 
> >> From: Yonik Seeley 
> >> To: java-user@lucene.apache.org
> >> Sent: Wednesday, 15 July, 2009 23:34:29
> >> Subject: Re: speed of BooleanQueries on 2.9
> >>
> >> On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
> >> > And the fix only affects custom DocIdSetIterators.
> >>
> >> And custom Queries (via Scorer) since Scorer inherits from DISI.
> >> But as Mike says, it shouldn't be the issue behind in this thread.
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: speed of BooleanQueries on 2.9

2009-07-15 Thread Uwe Schindler
You can look into the JavaDocs, which lists all child classes. From there
you can click through it

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: Thursday, July 16, 2009 12:06 AM
> To: java-user@lucene.apache.org
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Wed, Jul 15, 2009 at 5:57 PM, eks dev wrote:
> > it works with current trunk, 10 Minutes ago built?!
> 
> Hmmm, OK, maybe it was the DISI bug.
> Do we have any Scorers in Lucene that forgot to implement advance()
> and hence got the slow default version???
> Not sure how to ask the IDE for that info...
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> 
> >
> > if I put lucene from yesterday, the same symptoms like yesterday...
> >
> > Mike's instrumented version is running ...
> >
> >
> >
> > - Original Message 
> >> From: Yonik Seeley 
> >> To: java-user@lucene.apache.org
> >> Sent: Wednesday, 15 July, 2009 23:34:29
> >> Subject: Re: speed of BooleanQueries on 2.9
> >>
> >> On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
> >> > And the fix only affects custom DocIdSetIterators.
> >>
> >> And custom Queries (via Scorer) since Scorer inherits from DISI.
> >> But as Mike says, it shouldn't be the issue behind in this thread.
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Yonik Seeley
On Wed, Jul 15, 2009 at 5:57 PM, eks dev wrote:
> it works with current trunk, 10 Minutes ago built?!

Hmmm, OK, maybe it was the DISI bug.
Do we have any Scorers in Lucene that forgot to implement advance()
and hence got the slow default version???
Not sure how to ask the IDE for that info...

-Yonik
http://www.lucidimagination.com




>
> if I put lucene from yesterday, the same symptoms like yesterday...
>
> Mike's instrumented version is running ...
>
>
>
> - Original Message 
>> From: Yonik Seeley 
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, 15 July, 2009 23:34:29
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>> On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
>> > And the fix only affects custom DocIdSetIterators.
>>
>> And custom Queries (via Scorer) since Scorer inherits from DISI.
>> But as Mike says, it shouldn't be the issue behind in this thread.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

Mike's instrumented version is not printing anything on this query
and it works fine with trunk version

BS2 gets executed (top Query Required... +((( )))? 

again the Query:
Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 
NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352 
NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682 
NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552 
NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408 
NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682 
NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632 
NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682 
NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408 
NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632 
NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682 
NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523 
NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002 
NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922 
NAME:beugarski^0.20281483 NAME:blacharski^0.1922
 NAME:lekarski^0.1922 NAME:pecarski^0.21294187 NAME:peikarski^0.27648002 
NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187 
NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 NAME:pickarski^0.22073482 
NAME:piekalski^0.23941332 NAME:piekanski^0.23941332 NAME:piekaraka^0.2255 
NAME:piekarsci^0.29205337 NAME:piekarska^0.28421336 NAME:piekarskie^0.25392002 
NAME:piekarsky^0.29205337 NAME:piekarzcyk^0.23232001 NAME:piekarzki^0.29205337 
NAME:piekaski^0.24843001 NAME:piekavska^0.2255 NAME:piekorski^0.28421336 
NAME:pielarski^0.22997928 NAME:pierarski^0.22997928 NAME:pierkarski^0.24661335 
NAME:piesarski^0.22997928 NAME:pietarski^0.22997928 NAME:pietkarski^0.24661335 
NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 NAME:pirkarski^0.22073482 
NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 NAME:polikarski^0.20172001 
NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 NAME:siekarski^0.20281483))^2.0)








- Original Message 
> From: eks dev 
> To: java-user@lucene.apache.org; yo...@lucidimagination.com
> Sent: Wednesday, 15 July, 2009 23:57:22
> Subject: Re: speed of BooleanQueries on 2.9
> 
> 
> 
> it works with current trunk, 10 Minutes ago built?!
> 
> if I put lucene from yesterday, the same symptoms like yesterday...  
> 
> Mike's instrumented version is running ...
> 
> 
> 
> - Original Message 
> > From: Yonik Seeley 
> > To: java-user@lucene.apache.org
> > Sent: Wednesday, 15 July, 2009 23:34:29
> > Subject: Re: speed of BooleanQueries on 2.9
> > 
> > On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
> > > And the fix only affects custom DocIdSetIterators.
> > 
> > And custom Queries (via Scorer) since Scorer inherits from DISI.
> > But as Mike says, it shouldn't be the issue behind in this thread.
> > 
> > -Yonik
> > http://www.lucidimagination.com
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev


it works with current trunk, 10 Minutes ago built?!

if I put lucene from yesterday, the same symptoms like yesterday...   

Mike's instrumented version is running ...



- Original Message 
> From: Yonik Seeley 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 23:34:29
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindlerwrote:
> > And the fix only affects custom DocIdSetIterators.
> 
> And custom Queries (via Scorer) since Scorer inherits from DISI.
> But as Mike says, it shouldn't be the issue behind in this thread.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Yonik Seeley
On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindler wrote:
> And the fix only affects custom DocIdSetIterators.

And custom Queries (via Scorer) since Scorer inherits from DISI.
But as Mike says, it shouldn't be the issue behind in this thread.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

I do, but not on this Query... the same happens when I use Luke





- Original Message 
> From: Uwe Schindler 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 22:37:04
> Subject: RE: speed of BooleanQueries on 2.9
> 
> And the fix only affects custom DocIdSetIterators. The ones from Lucene core
> all implement the new API and do it more effective than the example code :-)
> 
> Or does Eks Dev use custom DocIdSetIterators?
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Michael McCandless [mailto:luc...@mikemccandless.com]
> > Sent: Wednesday, July 15, 2009 10:25 PM
> > To: java-user@lucene.apache.org; yo...@lucidimagination.com
> > Subject: Re: speed of BooleanQueries on 2.9
> > 
> > I just committed Uwe's fix for that (thanks Uwe!), but I don't think
> > it's causing eks' slowdown because eks' case is a straight OR query,
> > which doesn't use advance.
> > 
> > Mike
> > 
> > On Wed, Jul 15, 2009 at 3:23 PM, Yonik Seeley
> > wrote:
> > > Could this perhaps have anything to do with the changes to
> > DocIdSetIterator?
> > > Glancing at the default implementation of advance makes me wince a bit:
> > >
> > >  public int advance(int target) throws IOException {
> > >while (nextDoc() < target) {}
> > >return doc;
> > >  }
> > >
> > > IMO, this is a back-compatibility anti-pattern.  It would be better to
> > > throw an exception then quietly slow down some of the users queries by
> > > an order of magnitude.  Actually, I don't think I would count it as
> > > back compatible because of that.
> > >
> > > -Yonik
> > > http://www.lucidimagination.com
> > >
> > >
> > >
> > > On Wed, Jul 15, 2009 at 2:54 PM, Michael
> > > McCandlesswrote:
> > >> On Wed, Jul 15, 2009 at 2:30 PM, eks devwrote:
> > >>>
> >  Weird.  Have you run CheckIndex?
> > >>> nope, I guess it brings nothing: two times built index; Bug provoked
> > by changing one parameter  that controls only search caused it => no
> > corrupt index?
> > >>>
> > >>> You think we should give it a try? Hell, why not :)
> > >>
> > >> Yah it's quite a long shot but if it is corrupt, we'll be kicking
> > >> ourselves about 30 emails from now...
> > >>
> > >>> What do you mean by "Can you do a binary search to locate the term(s)
> > that's causing it?"
> > >>>
> > >>> I know exactly which term combination causes it, last Query.toString()
> > I have sent if I simplify Query by dropping one term with its
> > expansions, it runs fine... or if I replace any of these terms it works
> > fine,We tried with higer freq. terms, lower... everything fine... bizzar
> > >>
> > >> Right I meant try to whittle down the query that tickles the infinite
> > >> loop.  Sounds like any whittling causes the issue to scurry away.
> > >>
> > >> If I make a patch that adds verbosity to what BS is doing, can you run
> > >> it & post the output?
> > >>
> > >> Mike
> > >>
> > >> -
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >>
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
OK let's start w/ the attached patch?  It'll produce a ridiculous
amount of output (one line for each doc collected).  If that's a
problem you can comment out the "BS collect" line.

Mike

On Wed, Jul 15, 2009 at 4:27 PM, Michael
McCandless wrote:
> OK I'll instrument.
>
> Mike
>
> On Wed, Jul 15, 2009 at 3:28 PM, eks dev wrote:
>>
>>> If I make a patch that adds verbosity to what BS is doing, can you run
>>> it & post the output?
>>
>> can do, it can take some time
>>
>>
>>
>> - Original Message 
>>> From: Michael McCandless 
>>> To: java-user@lucene.apache.org
>>> Sent: Wednesday, 15 July, 2009 20:54:25
>>> Subject: Re: speed of BooleanQueries on 2.9
>>>
>>> On Wed, Jul 15, 2009 at 2:30 PM, eks devwrote:
>>> >
>>> >> Weird.  Have you run CheckIndex?
>>> > nope, I guess it brings nothing: two times built index; Bug provoked by
>>> changing one parameter  that controls only search caused it => no corrupt 
>>> index?
>>> >
>>> > You think we should give it a try? Hell, why not :)
>>>
>>> Yah it's quite a long shot but if it is corrupt, we'll be kicking
>>> ourselves about 30 emails from now...
>>>
>>> > What do you mean by "Can you do a binary search to locate the term(s) 
>>> > that's
>>> causing it?"
>>> >
>>> > I know exactly which term combination causes it, last Query.toString() I 
>>> > have
>>> sent if I simplify Query by dropping one term with its expansions, it 
>>> runs
>>> fine... or if I replace any of these terms it works fine,We tried with higer
>>> freq. terms, lower... everything fine... bizzar
>>>
>>> Right I meant try to whittle down the query that tickles the infinite
>>> loop.  Sounds like any whittling causes the issue to scurry away.
>>>
>>> If I make a patch that adds verbosity to what BS is doing, can you run
>>> it & post the output?
>>>
>>> Mike
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: speed of BooleanQueries on 2.9

2009-07-15 Thread Uwe Schindler
And the fix only affects custom DocIdSetIterators. The ones from Lucene core
all implement the new API and do it more effective than the example code :-)

Or does Eks Dev use custom DocIdSetIterators?

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Wednesday, July 15, 2009 10:25 PM
> To: java-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: Re: speed of BooleanQueries on 2.9
> 
> I just committed Uwe's fix for that (thanks Uwe!), but I don't think
> it's causing eks' slowdown because eks' case is a straight OR query,
> which doesn't use advance.
> 
> Mike
> 
> On Wed, Jul 15, 2009 at 3:23 PM, Yonik Seeley
> wrote:
> > Could this perhaps have anything to do with the changes to
> DocIdSetIterator?
> > Glancing at the default implementation of advance makes me wince a bit:
> >
> >  public int advance(int target) throws IOException {
> >    while (nextDoc() < target) {}
> >    return doc;
> >  }
> >
> > IMO, this is a back-compatibility anti-pattern.  It would be better to
> > throw an exception then quietly slow down some of the users queries by
> > an order of magnitude.  Actually, I don't think I would count it as
> > back compatible because of that.
> >
> > -Yonik
> > http://www.lucidimagination.com
> >
> >
> >
> > On Wed, Jul 15, 2009 at 2:54 PM, Michael
> > McCandless wrote:
> >> On Wed, Jul 15, 2009 at 2:30 PM, eks dev wrote:
> >>>
>  Weird.  Have you run CheckIndex?
> >>> nope, I guess it brings nothing: two times built index; Bug provoked
> by changing one parameter  that controls only search caused it => no
> corrupt index?
> >>>
> >>> You think we should give it a try? Hell, why not :)
> >>
> >> Yah it's quite a long shot but if it is corrupt, we'll be kicking
> >> ourselves about 30 emails from now...
> >>
> >>> What do you mean by "Can you do a binary search to locate the term(s)
> that's causing it?"
> >>>
> >>> I know exactly which term combination causes it, last Query.toString()
> I have sent if I simplify Query by dropping one term with its
> expansions, it runs fine... or if I replace any of these terms it works
> fine,We tried with higer freq. terms, lower... everything fine... bizzar
> >>
> >> Right I meant try to whittle down the query that tickles the infinite
> >> loop.  Sounds like any whittling causes the issue to scurry away.
> >>
> >> If I make a patch that adds verbosity to what BS is doing, can you run
> >> it & post the output?
> >>
> >> Mike
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
OK I'll instrument.

Mike

On Wed, Jul 15, 2009 at 3:28 PM, eks dev wrote:
>
>> If I make a patch that adds verbosity to what BS is doing, can you run
>> it & post the output?
>
> can do, it can take some time
>
>
>
> - Original Message 
>> From: Michael McCandless 
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, 15 July, 2009 20:54:25
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>> On Wed, Jul 15, 2009 at 2:30 PM, eks devwrote:
>> >
>> >> Weird.  Have you run CheckIndex?
>> > nope, I guess it brings nothing: two times built index; Bug provoked by
>> changing one parameter  that controls only search caused it => no corrupt 
>> index?
>> >
>> > You think we should give it a try? Hell, why not :)
>>
>> Yah it's quite a long shot but if it is corrupt, we'll be kicking
>> ourselves about 30 emails from now...
>>
>> > What do you mean by "Can you do a binary search to locate the term(s) 
>> > that's
>> causing it?"
>> >
>> > I know exactly which term combination causes it, last Query.toString() I 
>> > have
>> sent if I simplify Query by dropping one term with its expansions, it 
>> runs
>> fine... or if I replace any of these terms it works fine,We tried with higer
>> freq. terms, lower... everything fine... bizzar
>>
>> Right I meant try to whittle down the query that tickles the infinite
>> loop.  Sounds like any whittling causes the issue to scurry away.
>>
>> If I make a patch that adds verbosity to what BS is doing, can you run
>> it & post the output?
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
I just committed Uwe's fix for that (thanks Uwe!), but I don't think
it's causing eks' slowdown because eks' case is a straight OR query,
which doesn't use advance.

Mike

On Wed, Jul 15, 2009 at 3:23 PM, Yonik Seeley wrote:
> Could this perhaps have anything to do with the changes to DocIdSetIterator?
> Glancing at the default implementation of advance makes me wince a bit:
>
>  public int advance(int target) throws IOException {
>    while (nextDoc() < target) {}
>    return doc;
>  }
>
> IMO, this is a back-compatibility anti-pattern.  It would be better to
> throw an exception then quietly slow down some of the users queries by
> an order of magnitude.  Actually, I don't think I would count it as
> back compatible because of that.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Wed, Jul 15, 2009 at 2:54 PM, Michael
> McCandless wrote:
>> On Wed, Jul 15, 2009 at 2:30 PM, eks dev wrote:
>>>
 Weird.  Have you run CheckIndex?
>>> nope, I guess it brings nothing: two times built index; Bug provoked by 
>>> changing one parameter  that controls only search caused it => no corrupt 
>>> index?
>>>
>>> You think we should give it a try? Hell, why not :)
>>
>> Yah it's quite a long shot but if it is corrupt, we'll be kicking
>> ourselves about 30 emails from now...
>>
>>> What do you mean by "Can you do a binary search to locate the term(s) 
>>> that's causing it?"
>>>
>>> I know exactly which term combination causes it, last Query.toString() I 
>>> have sent if I simplify Query by dropping one term with its expansions, 
>>> it runs fine... or if I replace any of these terms it works fine,We tried 
>>> with higer freq. terms, lower... everything fine... bizzar
>>
>> Right I meant try to whittle down the query that tickles the infinite
>> loop.  Sounds like any whittling causes the issue to scurry away.
>>
>> If I make a patch that adds verbosity to what BS is doing, can you run
>> it & post the output?
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: speed of BooleanQueries on 2.9

2009-07-15 Thread Uwe Schindler
To correctly implement the backwards-pattern, it should call skipTo:

  public int advance(int target) throws IOException {
return doc = skipTo(target) ? doc() : NO_MORE_DOCS;
  }

This is how nextDoc is implemented. New iterator that override advance()
work correct, older ones implementing skipTo would work with this method.

In my opinion, this should be changed, the pattern is wrong.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: Wednesday, July 15, 2009 9:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: speed of BooleanQueries on 2.9
> 
> Could this perhaps have anything to do with the changes to
> DocIdSetIterator?
> Glancing at the default implementation of advance makes me wince a bit:
> 
>  public int advance(int target) throws IOException {
> while (nextDoc() < target) {}
> return doc;
>   }
> 
> IMO, this is a back-compatibility anti-pattern.  It would be better to
> throw an exception then quietly slow down some of the users queries by
> an order of magnitude.  Actually, I don't think I would count it as
> back compatible because of that.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> On Wed, Jul 15, 2009 at 2:54 PM, Michael
> McCandless wrote:
> > On Wed, Jul 15, 2009 at 2:30 PM, eks dev wrote:
> >>
> >>> Weird.  Have you run CheckIndex?
> >> nope, I guess it brings nothing: two times built index; Bug provoked by
> changing one parameter  that controls only search caused it => no corrupt
> index?
> >>
> >> You think we should give it a try? Hell, why not :)
> >
> > Yah it's quite a long shot but if it is corrupt, we'll be kicking
> > ourselves about 30 emails from now...
> >
> >> What do you mean by "Can you do a binary search to locate the term(s)
> that's causing it?"
> >>
> >> I know exactly which term combination causes it, last Query.toString()
> I have sent if I simplify Query by dropping one term with its
> expansions, it runs fine... or if I replace any of these terms it works
> fine,We tried with higer freq. terms, lower... everything fine... bizzar
> >
> > Right I meant try to whittle down the query that tickles the infinite
> > loop.  Sounds like any whittling causes the issue to scurry away.
> >
> > If I make a patch that adds verbosity to what BS is doing, can you run
> > it & post the output?
> >
> > Mike
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

> If I make a patch that adds verbosity to what BS is doing, can you run
> it & post the output?

can do, it can take some time



- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 20:54:25
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Wed, Jul 15, 2009 at 2:30 PM, eks devwrote:
> >
> >> Weird.  Have you run CheckIndex?
> > nope, I guess it brings nothing: two times built index; Bug provoked by 
> changing one parameter  that controls only search caused it => no corrupt 
> index?
> >
> > You think we should give it a try? Hell, why not :)
> 
> Yah it's quite a long shot but if it is corrupt, we'll be kicking
> ourselves about 30 emails from now...
> 
> > What do you mean by "Can you do a binary search to locate the term(s) 
> > that's 
> causing it?"
> >
> > I know exactly which term combination causes it, last Query.toString() I 
> > have 
> sent if I simplify Query by dropping one term with its expansions, it 
> runs 
> fine... or if I replace any of these terms it works fine,We tried with higer 
> freq. terms, lower... everything fine... bizzar
> 
> Right I meant try to whittle down the query that tickles the infinite
> loop.  Sounds like any whittling causes the issue to scurry away.
> 
> If I make a patch that adds verbosity to what BS is doing, can you run
> it & post the output?
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Yonik Seeley
Could this perhaps have anything to do with the changes to DocIdSetIterator?
Glancing at the default implementation of advance makes me wince a bit:

 public int advance(int target) throws IOException {
while (nextDoc() < target) {}
return doc;
  }

IMO, this is a back-compatibility anti-pattern.  It would be better to
throw an exception then quietly slow down some of the users queries by
an order of magnitude.  Actually, I don't think I would count it as
back compatible because of that.

-Yonik
http://www.lucidimagination.com



On Wed, Jul 15, 2009 at 2:54 PM, Michael
McCandless wrote:
> On Wed, Jul 15, 2009 at 2:30 PM, eks dev wrote:
>>
>>> Weird.  Have you run CheckIndex?
>> nope, I guess it brings nothing: two times built index; Bug provoked by 
>> changing one parameter  that controls only search caused it => no corrupt 
>> index?
>>
>> You think we should give it a try? Hell, why not :)
>
> Yah it's quite a long shot but if it is corrupt, we'll be kicking
> ourselves about 30 emails from now...
>
>> What do you mean by "Can you do a binary search to locate the term(s) that's 
>> causing it?"
>>
>> I know exactly which term combination causes it, last Query.toString() I 
>> have sent if I simplify Query by dropping one term with its expansions, 
>> it runs fine... or if I replace any of these terms it works fine,We tried 
>> with higer freq. terms, lower... everything fine... bizzar
>
> Right I meant try to whittle down the query that tickles the infinite
> loop.  Sounds like any whittling causes the issue to scurry away.
>
> If I make a patch that adds verbosity to what BS is doing, can you run
> it & post the output?
>
> Mike
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
On Wed, Jul 15, 2009 at 2:30 PM, eks dev wrote:
>
>> Weird.  Have you run CheckIndex?
> nope, I guess it brings nothing: two times built index; Bug provoked by 
> changing one parameter  that controls only search caused it => no corrupt 
> index?
>
> You think we should give it a try? Hell, why not :)

Yah it's quite a long shot but if it is corrupt, we'll be kicking
ourselves about 30 emails from now...

> What do you mean by "Can you do a binary search to locate the term(s) that's 
> causing it?"
>
> I know exactly which term combination causes it, last Query.toString() I have 
> sent if I simplify Query by dropping one term with its expansions, it 
> runs fine... or if I replace any of these terms it works fine,We tried with 
> higer freq. terms, lower... everything fine... bizzar

Right I meant try to whittle down the query that tickles the infinite
loop.  Sounds like any whittling causes the issue to scurry away.

If I make a patch that adds verbosity to what BS is doing, can you run
it & post the output?

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
Well, skipTo does in fact throw UOE.

And BS.next() does in fact work, which is interesting, but it will
next() through docs out-of-order, which BS2 won't like.  Does anyone
know of any cases where BS.next() is in fact used?

Mike

On Wed, Jul 15, 2009 at 2:15 PM, Paul Elschot wrote:
> As long as next(), skipTo(), doc() and score() on a Scorer work,
> the search will be done. I hope the results are correct in this
> case, but I'm not sure.
>
> Regards,
> Paul Elschot
>
> On Wednesday 15 July 2009 19:08:00 Michael McCandless wrote:
>> I don't think a toplevel BS2 is able to use BS as sub-scorers?  BS2
>> needs to do doc-at-once, for all sub-scorers, but BS can't do that.  I
>> think?
>>
>> Mike
>>
>> On Wed, Jul 15, 2009 at 12:10 PM, Paul Elschot wrote:
>> > On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote:
>> >> So now I'm confused.  Since your query has required (+) clauses, the
>> >> setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.
>> >
>> > Probably the top level BQ is using BS2 because of the required clauses,
>> > but the nested BQ's are using BS because the docs are allowed out of order.
>> >
>> > In that case BS2 will use skipTo() on BS, and the BS.skipTo() 
>> > implementation
>> > could well be the culprit for performance. A long time ago BS.skipTo() 
>> > used to
>> > throw an unsupported operation exception, but that does not seem to
>> > be happening.
>> >
>> > Eks, could you try a toString() on the top level scorer for one of the 
>> > affected
>> > queries to see whether it shows BS2 on top level and BS for the inner 
>> > scorers?
>> >
>> > Regards,
>> > Paul Elschot
>> >
>> >
>> >>
>> >> BooleanQuery only uses BooleanScorer when there are no required terms,
>> >> and allowDocsOutOfOrder is true.  So I can't explain why you see this
>> >> setting changing anything on this query...
>> >>
>> >> Mike
>> >>
>> >> On Tue, Jul 14, 2009 at 7:04 PM, eks dev wrote:
>> >> >
>> >> > I do not know exactly why, but
>> >> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, 
>> >> > but with setAllowDocsOutOfOrder(false);  no problems whatsoever
>> >> >
>> >> > not really scientific method to find such bug, but does the job and 
>> >> > makes me happy.
>> >> >
>> >> > Empirical, "deprecated methods are not to be taken as thoroughly 
>> >> > tested, as they have short life expectancy"
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > - Original Message 
>> >> >> From: eks dev 
>> >> >> To: java-user@lucene.apache.org
>> >> >> Sent: Wednesday, 15 July, 2009 0:24:43
>> >> >> Subject: Re: speed of BooleanQueries on 2.9
>> >> >>
>> >> >>
>> >> >> Mike, we are definitely hitting something with this one!
>> >> >>
>> >> >> we had report from our QA chaps that our servers got stuck (limit is 
>> >> >> on 180
>> >> >> Seconds Request)... We are on average 14 Requsts per second has 
>> >> >> nothing to
>> >> >> do with gc() as
>> >> >> we can repeat it with freshly restarted searcher.
>> >> >>
>> >> >> - it happens on a less than 0.1% of queries, not much of a  pattern, 
>> >> >> repeatable
>> >> >> on our index...
>> >> >> it is always combination of two expanded tokens (we use
>> >> >> minimumNooShouldMatch)...
>> >> >>
>> >> >> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
>> >> >> all tokens are with set boost, and  minNumShouldMatch is set to two
>> >> >>
>> >> >> I cannot provide self-contained test, nor index (contains sensitive 
>> >> >> data and is
>> >> >> rather big, ~5G)
>> >> >>
>> >> >> I can repeat this test on t1 and t2 with 40 expansions each. even if I 
>> >> >> take the
>> >> >> most frequent tokens in collection it runs well under one second...but 
>> >> >> these two
>> >> >> particular tokens with their "expansions" are making it run forever...
>> >> >>
>> >> >> and yes, if I run t1 plus expansions only, it runs super fast, the 
>> >> >> same for t2
>> >> >>
>> >> >> java 1.4U14, tried wit 1.6U6, no changes...
>> >> >>
>> >> >> will report if I dig something out
>> >> >>
>> >> >> partial stack trace while "stuck", cpu is on max:
>> >> >>
>> >> >> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
>> >> >> Source)
>> >> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> >> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> >> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> >> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> >> >> org.apache.lucene.search.Searcher.search(Unknown Source)
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> - Original Message 
>> >> >> > From: eks dev
>> >> >> > To: java-user@lucene.apache.org
>> >> >> > Sent: Monday, 13 July, 2009 13:28:45
>> >> >> > Subject: Re: speed of BooleanQueries on 2.9
>> >> >> >
>> >> >> > Hi Mike,
>> >> >> >
>> >> >> > getMaxNumOfCandidates() in test was 200, Index is optimised and 
>> >> >> > read-only
>> >> >> >
>> >> >> > We found (due to

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

> Weird.  Have you run CheckIndex?
nope, I guess it brings nothing: two times built index; Bug provoked by 
changing one parameter  that controls only search caused it => no corrupt index?

You think we should give it a try? Hell, why not :)

What do you mean by "Can you do a binary search to locate the term(s) that's 
causing it?"

I know exactly which term combination causes it, last Query.toString() I have 
sent if I simplify Query by dropping one term with its expansions, it runs 
fine... or if I replace any of these terms it works fine,We tried with higer 
freq. terms, lower... everything fine... bizzar

 




- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 19:57:09
> Subject: Re: speed of BooleanQueries on 2.9
> 
> OK thanks for the updates.  Yes, we are on the hunt now ;)  Something
> nasty is lurking...
> 
> Weird.  Have you run CheckIndex?
> 
> Can you do a binary search to locate the term(s) that's causing it?
> 
> It's great you see 10% speedup in searching overall (excluding these ones...)!
> 
> Mike
> 
> On Wed, Jul 15, 2009 at 1:49 PM, eks devwrote:
> >
> >
> > 1. pls forget minNumberShould match, it is NOT set on this particular query 
> (minNumberShouldMatch is determined dynamically, depending on semantics of 
> user 
> query... sometimes triggers, sometimes not...).
> > This Exact Query here causes search to take longer than 180 Seconds with 
>  allowDocsOutOfOrder = true, and less than 70mS with false. Repeatable?!? No 
> gc() effects involved... on 2.4 it does not happen, it works fine with both 
> true/false for allowDocsOutOfOrder
> >
> > 2. re your test, That is exactly what makes me wonder, we also see average 
> performance almost 10% better on 2.9 (even on this index when we exclude 
> these 
> stuck searches),  but on this particular index our customer's QA managed to 
> find 
> these "stuck requests".
> >
> > 3. If I change tokens involved, in exactly same-structured Query, it runs 
> > fine 
> => The problem is somehow term-defendant (bah!)
> >
> > Please understand that I do not have direct access to this index and it 
> > makes 
> debug cycles slightly longer. Typically I give them some jar-s and they run 
> it 
> ans send me logs back... Sorry for inaccuracies in description, but I am sure 
> there is a problem in lucene... We tried it with Luke as well, freshly built 
> index, we see exactly the same behavior (no bugs in our app that could cause 
> it, 
> except maybe wrong lucene usage somewhere)
> >
> >
> > Hard, but please stay with me, we will fix one ugly bug :)
> >
> >
> >
> >
> >
> >
> >
> > - Original Message 
> >> From: Michael McCandless 
> >> To: java-user@lucene.apache.org
> >> Sent: Wednesday, 15 July, 2009 19:27:24
> >> Subject: Re: speed of BooleanQueries on 2.9
> >>
> >> But, that query can't accept a minNumberShouldMatch -- are you really
> >> setting that?  (You get 0 results if you set it, because the top
> >> boolean query has a single required clause).  Maybe you set it only on
> >> the inner large OR-query?  (But then I don't see the ~2 on that inner
> >> clause).
> >>
> >> I've tested a 21 term OR query, with allowDocsOutOfOrder true,
> >> numHits=200 on a Wikpedia index that matches 10M docs and I'm seeing
> >> the same perf on trunk & 2.4.
> >>
> >> Mike
> >>
> >> On Wed, Jul 15, 2009 at 11:41 AM, eks devwrote:
> >> >
> >> > sorry for confusion, here is exact query that runs forever with
> >> setAllowDocsOutOfOrder:
> >> > You see it on stack trace taken while "stuck"
> >> 
> o.a.l.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(UnknownSource)
> >> >
> >> >
> >> > Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632
> >> NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352
> >> NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682
> >> NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552
> >> NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408
> >> NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682
> >> NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632
> >> NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682
> >> NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408
> >> NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632
> >> NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682
> >> NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523
> >> NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002
> >> NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922
> >> NAME:beugarski^0.20281483 NAME:blacharski^0.1922
> >> >  NAME:lekarski^0.1922 NAME:pecarski^0.21294187 
> NAME:peikarski^0.27648002
> >> NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187
> >> NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 
> >> NAME:pickarski^0.22073482
> >> NAME:piekalski^0.

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Paul Elschot
As long as next(), skipTo(), doc() and score() on a Scorer work,
the search will be done. I hope the results are correct in this
case, but I'm not sure.

Regards,
Paul Elschot

On Wednesday 15 July 2009 19:08:00 Michael McCandless wrote:
> I don't think a toplevel BS2 is able to use BS as sub-scorers?  BS2
> needs to do doc-at-once, for all sub-scorers, but BS can't do that.  I
> think?
> 
> Mike
> 
> On Wed, Jul 15, 2009 at 12:10 PM, Paul Elschot wrote:
> > On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote:
> >> So now I'm confused.  Since your query has required (+) clauses, the
> >> setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.
> >
> > Probably the top level BQ is using BS2 because of the required clauses,
> > but the nested BQ's are using BS because the docs are allowed out of order.
> >
> > In that case BS2 will use skipTo() on BS, and the BS.skipTo() implementation
> > could well be the culprit for performance. A long time ago BS.skipTo() used 
> > to
> > throw an unsupported operation exception, but that does not seem to
> > be happening.
> >
> > Eks, could you try a toString() on the top level scorer for one of the 
> > affected
> > queries to see whether it shows BS2 on top level and BS for the inner 
> > scorers?
> >
> > Regards,
> > Paul Elschot
> >
> >
> >>
> >> BooleanQuery only uses BooleanScorer when there are no required terms,
> >> and allowDocsOutOfOrder is true.  So I can't explain why you see this
> >> setting changing anything on this query...
> >>
> >> Mike
> >>
> >> On Tue, Jul 14, 2009 at 7:04 PM, eks dev wrote:
> >> >
> >> > I do not know exactly why, but
> >> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, 
> >> > but with setAllowDocsOutOfOrder(false);  no problems whatsoever
> >> >
> >> > not really scientific method to find such bug, but does the job and 
> >> > makes me happy.
> >> >
> >> > Empirical, "deprecated methods are not to be taken as thoroughly tested, 
> >> > as they have short life expectancy"
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > - Original Message 
> >> >> From: eks dev 
> >> >> To: java-user@lucene.apache.org
> >> >> Sent: Wednesday, 15 July, 2009 0:24:43
> >> >> Subject: Re: speed of BooleanQueries on 2.9
> >> >>
> >> >>
> >> >> Mike, we are definitely hitting something with this one!
> >> >>
> >> >> we had report from our QA chaps that our servers got stuck (limit is on 
> >> >> 180
> >> >> Seconds Request)... We are on average 14 Requsts per second has 
> >> >> nothing to
> >> >> do with gc() as
> >> >> we can repeat it with freshly restarted searcher.
> >> >>
> >> >> - it happens on a less than 0.1% of queries, not much of a  pattern, 
> >> >> repeatable
> >> >> on our index...
> >> >> it is always combination of two expanded tokens (we use
> >> >> minimumNooShouldMatch)...
> >> >>
> >> >> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
> >> >> all tokens are with set boost, and  minNumShouldMatch is set to two
> >> >>
> >> >> I cannot provide self-contained test, nor index (contains sensitive 
> >> >> data and is
> >> >> rather big, ~5G)
> >> >>
> >> >> I can repeat this test on t1 and t2 with 40 expansions each. even if I 
> >> >> take the
> >> >> most frequent tokens in collection it runs well under one second...but 
> >> >> these two
> >> >> particular tokens with their "expansions" are making it run forever...
> >> >>
> >> >> and yes, if I run t1 plus expansions only, it runs super fast, the same 
> >> >> for t2
> >> >>
> >> >> java 1.4U14, tried wit 1.6U6, no changes...
> >> >>
> >> >> will report if I dig something out
> >> >>
> >> >> partial stack trace while "stuck", cpu is on max:
> >> >>
> >> >> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
> >> >> Source)
> >> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> >> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> >> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
> >> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
> >> >> org.apache.lucene.search.Searcher.search(Unknown Source)
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> - Original Message 
> >> >> > From: eks dev
> >> >> > To: java-user@lucene.apache.org
> >> >> > Sent: Monday, 13 July, 2009 13:28:45
> >> >> > Subject: Re: speed of BooleanQueries on 2.9
> >> >> >
> >> >> > Hi Mike,
> >> >> >
> >> >> > getMaxNumOfCandidates() in test was 200, Index is optimised and 
> >> >> > read-only
> >> >> >
> >> >> > We found (due to an error in our warm-up code, funny) that only this 
> >> >> > Query
> >> >> runs
> >> >> > slower on 2.9.
> >> >> >
> >> >> > A hint where to look could be that this Query cointains two, the most 
> >> >> > frequent
> >> >>
> >> >> > tokens in two particular fields
> >> >> > NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 
> >> >> > 3Mio
> >> >> unique
> >> >> > terms)
> >> >> >
> >> >> > But all o

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
OK thanks for the updates.  Yes, we are on the hunt now ;)  Something
nasty is lurking...

Weird.  Have you run CheckIndex?

Can you do a binary search to locate the term(s) that's causing it?

It's great you see 10% speedup in searching overall (excluding these ones...)!

Mike

On Wed, Jul 15, 2009 at 1:49 PM, eks dev wrote:
>
>
> 1. pls forget minNumberShould match, it is NOT set on this particular query 
> (minNumberShouldMatch is determined dynamically, depending on semantics of 
> user query... sometimes triggers, sometimes not...).
> This Exact Query here causes search to take longer than 180 Seconds with  
> allowDocsOutOfOrder = true, and less than 70mS with false. Repeatable?!? No 
> gc() effects involved... on 2.4 it does not happen, it works fine with both 
> true/false for allowDocsOutOfOrder
>
> 2. re your test, That is exactly what makes me wonder, we also see average 
> performance almost 10% better on 2.9 (even on this index when we exclude 
> these stuck searches),  but on this particular index our customer's QA 
> managed to find these "stuck requests".
>
> 3. If I change tokens involved, in exactly same-structured Query, it runs 
> fine => The problem is somehow term-defendant (bah!)
>
> Please understand that I do not have direct access to this index and it makes 
> debug cycles slightly longer. Typically I give them some jar-s and they run 
> it ans send me logs back... Sorry for inaccuracies in description, but I am 
> sure there is a problem in lucene... We tried it with Luke as well, freshly 
> built index, we see exactly the same behavior (no bugs in our app that could 
> cause it, except maybe wrong lucene usage somewhere)
>
>
> Hard, but please stay with me, we will fix one ugly bug :)
>
>
>
>
>
>
>
> - Original Message 
>> From: Michael McCandless 
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, 15 July, 2009 19:27:24
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>> But, that query can't accept a minNumberShouldMatch -- are you really
>> setting that?  (You get 0 results if you set it, because the top
>> boolean query has a single required clause).  Maybe you set it only on
>> the inner large OR-query?  (But then I don't see the ~2 on that inner
>> clause).
>>
>> I've tested a 21 term OR query, with allowDocsOutOfOrder true,
>> numHits=200 on a Wikpedia index that matches 10M docs and I'm seeing
>> the same perf on trunk & 2.4.
>>
>> Mike
>>
>> On Wed, Jul 15, 2009 at 11:41 AM, eks devwrote:
>> >
>> > sorry for confusion, here is exact query that runs forever with
>> setAllowDocsOutOfOrder:
>> > You see it on stack trace taken while "stuck"
>> o.a.l.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(UnknownSource)
>> >
>> >
>> > Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632
>> NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352
>> NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682
>> NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552
>> NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408
>> NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682
>> NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632
>> NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682
>> NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408
>> NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632
>> NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682
>> NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523
>> NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002
>> NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922
>> NAME:beugarski^0.20281483 NAME:blacharski^0.1922
>> >  NAME:lekarski^0.1922 NAME:pecarski^0.21294187 
>> > NAME:peikarski^0.27648002
>> NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187
>> NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 NAME:pickarski^0.22073482
>> NAME:piekalski^0.23941332 NAME:piekanski^0.23941332 NAME:piekaraka^0.2255
>> NAME:piekarsci^0.29205337 NAME:piekarska^0.28421336 
>> NAME:piekarskie^0.25392002
>> NAME:piekarsky^0.29205337 NAME:piekarzcyk^0.23232001 
>> NAME:piekarzki^0.29205337
>> NAME:piekaski^0.24843001 NAME:piekavska^0.2255 NAME:piekorski^0.28421336
>> NAME:pielarski^0.22997928 NAME:pierarski^0.22997928 
>> NAME:pierkarski^0.24661335
>> NAME:piesarski^0.22997928 NAME:pietarski^0.22997928 
>> NAME:pietkarski^0.24661335
>> NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 NAME:pirkarski^0.22073482
>> NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 
>> NAME:polikarski^0.20172001
>> NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 
>> NAME:siekarski^0.20281483))^2.0)
>> >
>> >
>> >
>> >
>> >
>> > - Original Message 
>> >> From: Michael McCandless
>> >> To: java-user@lucene.apache.org
>> >> Sent: Wednesday, 15 July, 2009 17:16:23
>> >> Subject: Re: speed of BooleanQueries on 2.9
>> >>
>> >> So now I'm confused

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev


> Is it possible for you to make the problem happen such that we get
> line numbers in this traceback?

sure, I will build lucene trunk  with debug/line numbers enabled and ask 
customer's QA to run it again...  

> Is CPU pegged when it's stuck?
Yes!, One core was 100% hot 



- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 19:30:42
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Wed, Jul 15, 2009 at 11:41 AM, eks devwrote:
> 
> > You see it on stack trace taken while "stuck" 
> o.a.l.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(UnknownSource)
> 
> Is it possible for you to make the problem happen such that we get
> line numbers in this traceback?
> 
> Is CPU pegged when it's stuck?
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev


1. pls forget minNumberShould match, it is NOT set on this particular query 
(minNumberShouldMatch is determined dynamically, depending on semantics of user 
query... sometimes triggers, sometimes not...). 
This Exact Query here causes search to take longer than 180 Seconds with  
allowDocsOutOfOrder = true, and less than 70mS with false. Repeatable?!? No 
gc() effects involved... on 2.4 it does not happen, it works fine with both 
true/false for allowDocsOutOfOrder 

2. re your test, That is exactly what makes me wonder, we also see average 
performance almost 10% better on 2.9 (even on this index when we exclude these 
stuck searches),  but on this particular index our customer's QA managed to 
find these "stuck requests". 

3. If I change tokens involved, in exactly same-structured Query, it runs fine 
=> The problem is somehow term-defendant (bah!)

Please understand that I do not have direct access to this index and it makes 
debug cycles slightly longer. Typically I give them some jar-s and they run it 
ans send me logs back... Sorry for inaccuracies in description, but I am sure 
there is a problem in lucene... We tried it with Luke as well, freshly built 
index, we see exactly the same behavior (no bugs in our app that could cause 
it, except maybe wrong lucene usage somewhere)
  

Hard, but please stay with me, we will fix one ugly bug :)

 





- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 19:27:24
> Subject: Re: speed of BooleanQueries on 2.9
> 
> But, that query can't accept a minNumberShouldMatch -- are you really
> setting that?  (You get 0 results if you set it, because the top
> boolean query has a single required clause).  Maybe you set it only on
> the inner large OR-query?  (But then I don't see the ~2 on that inner
> clause).
> 
> I've tested a 21 term OR query, with allowDocsOutOfOrder true,
> numHits=200 on a Wikpedia index that matches 10M docs and I'm seeing
> the same perf on trunk & 2.4.
> 
> Mike
> 
> On Wed, Jul 15, 2009 at 11:41 AM, eks devwrote:
> >
> > sorry for confusion, here is exact query that runs forever with 
> setAllowDocsOutOfOrder:
> > You see it on stack trace taken while "stuck" 
> o.a.l.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(UnknownSource)
> >
> >
> > Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 
> NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352 
> NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682 
> NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552 
> NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408 
> NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682 
> NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632 
> NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682 
> NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408 
> NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632 
> NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682 
> NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523 
> NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002 
> NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922 
> NAME:beugarski^0.20281483 NAME:blacharski^0.1922
> >  NAME:lekarski^0.1922 NAME:pecarski^0.21294187 
> > NAME:peikarski^0.27648002 
> NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187 
> NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 NAME:pickarski^0.22073482 
> NAME:piekalski^0.23941332 NAME:piekanski^0.23941332 NAME:piekaraka^0.2255 
> NAME:piekarsci^0.29205337 NAME:piekarska^0.28421336 
> NAME:piekarskie^0.25392002 
> NAME:piekarsky^0.29205337 NAME:piekarzcyk^0.23232001 
> NAME:piekarzki^0.29205337 
> NAME:piekaski^0.24843001 NAME:piekavska^0.2255 NAME:piekorski^0.28421336 
> NAME:pielarski^0.22997928 NAME:pierarski^0.22997928 
> NAME:pierkarski^0.24661335 
> NAME:piesarski^0.22997928 NAME:pietarski^0.22997928 
> NAME:pietkarski^0.24661335 
> NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 NAME:pirkarski^0.22073482 
> NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 
> NAME:polikarski^0.20172001 
> NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 
> NAME:siekarski^0.20281483))^2.0)
> >
> >
> >
> >
> >
> > - Original Message 
> >> From: Michael McCandless 
> >> To: java-user@lucene.apache.org
> >> Sent: Wednesday, 15 July, 2009 17:16:23
> >> Subject: Re: speed of BooleanQueries on 2.9
> >>
> >> So now I'm confused.  Since your query has required (+) clauses, the
> >> setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.
> >>
> >> BooleanQuery only uses BooleanScorer when there are no required terms,
> >> and allowDocsOutOfOrder is true.  So I can't explain why you see this
> >> setting changing anything on this query...
> >>
> >> Mike
> >>
> >> On Tue, Jul 14, 2009 at 7:04 PM, eks devwrote:
> >> >
> >> 

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
On Wed, Jul 15, 2009 at 11:41 AM, eks dev wrote:

> You see it on stack trace taken while "stuck" 
> o.a.l.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(UnknownSource)

Is it possible for you to make the problem happen such that we get
line numbers in this traceback?

Is CPU pegged when it's stuck?

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
But, that query can't accept a minNumberShouldMatch -- are you really
setting that?  (You get 0 results if you set it, because the top
boolean query has a single required clause).  Maybe you set it only on
the inner large OR-query?  (But then I don't see the ~2 on that inner
clause).

I've tested a 21 term OR query, with allowDocsOutOfOrder true,
numHits=200 on a Wikpedia index that matches 10M docs and I'm seeing
the same perf on trunk & 2.4.

Mike

On Wed, Jul 15, 2009 at 11:41 AM, eks dev wrote:
>
> sorry for confusion, here is exact query that runs forever with 
> setAllowDocsOutOfOrder:
> You see it on stack trace taken while "stuck" 
> o.a.l.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(UnknownSource)
>
>
> Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 
> NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352 
> NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682 
> NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552 
> NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408 
> NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682 
> NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632 
> NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682 
> NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408 
> NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632 
> NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682 
> NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523 
> NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002 
> NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922 
> NAME:beugarski^0.20281483 NAME:blacharski^0.1922
>  NAME:lekarski^0.1922 NAME:pecarski^0.21294187 NAME:peikarski^0.27648002 
> NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187 
> NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 NAME:pickarski^0.22073482 
> NAME:piekalski^0.23941332 NAME:piekanski^0.23941332 NAME:piekaraka^0.2255 
> NAME:piekarsci^0.29205337 NAME:piekarska^0.28421336 
> NAME:piekarskie^0.25392002 NAME:piekarsky^0.29205337 
> NAME:piekarzcyk^0.23232001 NAME:piekarzki^0.29205337 NAME:piekaski^0.24843001 
> NAME:piekavska^0.2255 NAME:piekorski^0.28421336 NAME:pielarski^0.22997928 
> NAME:pierarski^0.22997928 NAME:pierkarski^0.24661335 
> NAME:piesarski^0.22997928 NAME:pietarski^0.22997928 
> NAME:pietkarski^0.24661335 NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 
> NAME:pirkarski^0.22073482 NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 
> NAME:polikarski^0.20172001 NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 
> NAME:siekarski^0.20281483))^2.0)
>
>
>
>
>
> - Original Message 
>> From: Michael McCandless 
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, 15 July, 2009 17:16:23
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>> So now I'm confused.  Since your query has required (+) clauses, the
>> setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.
>>
>> BooleanQuery only uses BooleanScorer when there are no required terms,
>> and allowDocsOutOfOrder is true.  So I can't explain why you see this
>> setting changing anything on this query...
>>
>> Mike
>>
>> On Tue, Jul 14, 2009 at 7:04 PM, eks devwrote:
>> >
>> > I do not know exactly why, but
>> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
>> > with
>> setAllowDocsOutOfOrder(false);  no problems whatsoever
>> >
>> > not really scientific method to find such bug, but does the job and makes 
>> > me
>> happy.
>> >
>> > Empirical, "deprecated methods are not to be taken as thoroughly tested, as
>> they have short life expectancy"
>> >
>> >
>> >
>> >
>> >
>> > - Original Message 
>> >> From: eks dev
>> >> To: java-user@lucene.apache.org
>> >> Sent: Wednesday, 15 July, 2009 0:24:43
>> >> Subject: Re: speed of BooleanQueries on 2.9
>> >>
>> >>
>> >> Mike, we are definitely hitting something with this one!
>> >>
>> >> we had report from our QA chaps that our servers got stuck (limit is on 
>> >> 180
>> >> Seconds Request)... We are on average 14 Requsts per second has 
>> >> nothing
>> to
>> >> do with gc() as
>> >> we can repeat it with freshly restarted searcher.
>> >>
>> >> - it happens on a less than 0.1% of queries, not much of a  pattern,
>> repeatable
>> >> on our index...
>> >> it is always combination of two expanded tokens (we use
>> >> minimumNooShouldMatch)...
>> >>
>> >> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
>> >> all tokens are with set boost, and  minNumShouldMatch is set to two
>> >>
>> >> I cannot provide self-contained test, nor index (contains sensitive data 
>> >> and
>> is
>> >> rather big, ~5G)
>> >>
>> >> I can repeat this test on t1 and t2 with 40 expansions each. even if I 
>> >> take
>> the
>> >> most frequent tokens in collection it runs well under one second...but 
>> >> these
>> two
>> >> particular token

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
I don't think a toplevel BS2 is able to use BS as sub-scorers?  BS2
needs to do doc-at-once, for all sub-scorers, but BS can't do that.  I
think?

Mike

On Wed, Jul 15, 2009 at 12:10 PM, Paul Elschot wrote:
> On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote:
>> So now I'm confused.  Since your query has required (+) clauses, the
>> setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.
>
> Probably the top level BQ is using BS2 because of the required clauses,
> but the nested BQ's are using BS because the docs are allowed out of order.
>
> In that case BS2 will use skipTo() on BS, and the BS.skipTo() implementation
> could well be the culprit for performance. A long time ago BS.skipTo() used to
> throw an unsupported operation exception, but that does not seem to
> be happening.
>
> Eks, could you try a toString() on the top level scorer for one of the 
> affected
> queries to see whether it shows BS2 on top level and BS for the inner scorers?
>
> Regards,
> Paul Elschot
>
>
>>
>> BooleanQuery only uses BooleanScorer when there are no required terms,
>> and allowDocsOutOfOrder is true.  So I can't explain why you see this
>> setting changing anything on this query...
>>
>> Mike
>>
>> On Tue, Jul 14, 2009 at 7:04 PM, eks dev wrote:
>> >
>> > I do not know exactly why, but
>> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
>> > with setAllowDocsOutOfOrder(false);  no problems whatsoever
>> >
>> > not really scientific method to find such bug, but does the job and makes 
>> > me happy.
>> >
>> > Empirical, "deprecated methods are not to be taken as thoroughly tested, 
>> > as they have short life expectancy"
>> >
>> >
>> >
>> >
>> >
>> > - Original Message 
>> >> From: eks dev 
>> >> To: java-user@lucene.apache.org
>> >> Sent: Wednesday, 15 July, 2009 0:24:43
>> >> Subject: Re: speed of BooleanQueries on 2.9
>> >>
>> >>
>> >> Mike, we are definitely hitting something with this one!
>> >>
>> >> we had report from our QA chaps that our servers got stuck (limit is on 
>> >> 180
>> >> Seconds Request)... We are on average 14 Requsts per second has 
>> >> nothing to
>> >> do with gc() as
>> >> we can repeat it with freshly restarted searcher.
>> >>
>> >> - it happens on a less than 0.1% of queries, not much of a  pattern, 
>> >> repeatable
>> >> on our index...
>> >> it is always combination of two expanded tokens (we use
>> >> minimumNooShouldMatch)...
>> >>
>> >> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
>> >> all tokens are with set boost, and  minNumShouldMatch is set to two
>> >>
>> >> I cannot provide self-contained test, nor index (contains sensitive data 
>> >> and is
>> >> rather big, ~5G)
>> >>
>> >> I can repeat this test on t1 and t2 with 40 expansions each. even if I 
>> >> take the
>> >> most frequent tokens in collection it runs well under one second...but 
>> >> these two
>> >> particular tokens with their "expansions" are making it run forever...
>> >>
>> >> and yes, if I run t1 plus expansions only, it runs super fast, the same 
>> >> for t2
>> >>
>> >> java 1.4U14, tried wit 1.6U6, no changes...
>> >>
>> >> will report if I dig something out
>> >>
>> >> partial stack trace while "stuck", cpu is on max:
>> >>
>> >> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
>> >> Source)
>> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> >> org.apache.lucene.search.Searcher.search(Unknown Source)
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> - Original Message 
>> >> > From: eks dev
>> >> > To: java-user@lucene.apache.org
>> >> > Sent: Monday, 13 July, 2009 13:28:45
>> >> > Subject: Re: speed of BooleanQueries on 2.9
>> >> >
>> >> > Hi Mike,
>> >> >
>> >> > getMaxNumOfCandidates() in test was 200, Index is optimised and 
>> >> > read-only
>> >> >
>> >> > We found (due to an error in our warm-up code, funny) that only this 
>> >> > Query
>> >> runs
>> >> > slower on 2.9.
>> >> >
>> >> > A hint where to look could be that this Query cointains two, the most 
>> >> > frequent
>> >>
>> >> > tokens in two particular fields
>> >> > NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 3Mio
>> >> unique
>> >> > terms)
>> >> >
>> >> > But all of this *could be just wrong measurement*, I just could not 
>> >> > spend more
>> >>
>> >> > time to get to the bottom of this. We moved forward as we got overall 
>> >> > better
>> >> > average performance (sweet 10% in average) on much bigger real query 
>> >> > log from
>> >> > our regression test.
>> >> >
>> >> > Anyhow I just wanted to throw it out, maybe it triggers some synapses 
>> >> > :) If
>> >> > false alarm, sorry.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > - Original Message 
>> >> > > Fro

searching for c++, c#, etc...

2009-07-15 Thread Chris Salem
Hello,
I'm trying to search for the terms like c++ but the parser is stripping off the 
++.  I tried escaping the ++ with slashes but it's still stripping it off.  I 
could replace + with "plus", is that the best way to do it?  How come escaping 
isn't working?
thanks
Sincerely,
Chris Salem 


Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Paul Elschot
On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote:
> So now I'm confused.  Since your query has required (+) clauses, the
> setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.

Probably the top level BQ is using BS2 because of the required clauses,
but the nested BQ's are using BS because the docs are allowed out of order.

In that case BS2 will use skipTo() on BS, and the BS.skipTo() implementation
could well be the culprit for performance. A long time ago BS.skipTo() used to
throw an unsupported operation exception, but that does not seem to
be happening.

Eks, could you try a toString() on the top level scorer for one of the affected
queries to see whether it shows BS2 on top level and BS for the inner scorers?

Regards,
Paul Elschot


> 
> BooleanQuery only uses BooleanScorer when there are no required terms,
> and allowDocsOutOfOrder is true.  So I can't explain why you see this
> setting changing anything on this query...
> 
> Mike
> 
> On Tue, Jul 14, 2009 at 7:04 PM, eks dev wrote:
> >
> > I do not know exactly why, but
> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
> > with setAllowDocsOutOfOrder(false);  no problems whatsoever
> >
> > not really scientific method to find such bug, but does the job and makes 
> > me happy.
> >
> > Empirical, "deprecated methods are not to be taken as thoroughly tested, as 
> > they have short life expectancy"
> >
> >
> >
> >
> >
> > - Original Message 
> >> From: eks dev 
> >> To: java-user@lucene.apache.org
> >> Sent: Wednesday, 15 July, 2009 0:24:43
> >> Subject: Re: speed of BooleanQueries on 2.9
> >>
> >>
> >> Mike, we are definitely hitting something with this one!
> >>
> >> we had report from our QA chaps that our servers got stuck (limit is on 180
> >> Seconds Request)... We are on average 14 Requsts per second has 
> >> nothing to
> >> do with gc() as
> >> we can repeat it with freshly restarted searcher.
> >>
> >> - it happens on a less than 0.1% of queries, not much of a  pattern, 
> >> repeatable
> >> on our index...
> >> it is always combination of two expanded tokens (we use
> >> minimumNooShouldMatch)...
> >>
> >> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
> >> all tokens are with set boost, and  minNumShouldMatch is set to two
> >>
> >> I cannot provide self-contained test, nor index (contains sensitive data 
> >> and is
> >> rather big, ~5G)
> >>
> >> I can repeat this test on t1 and t2 with 40 expansions each. even if I 
> >> take the
> >> most frequent tokens in collection it runs well under one second...but 
> >> these two
> >> particular tokens with their "expansions" are making it run forever...
> >>
> >> and yes, if I run t1 plus expansions only, it runs super fast, the same 
> >> for t2
> >>
> >> java 1.4U14, tried wit 1.6U6, no changes...
> >>
> >> will report if I dig something out
> >>
> >> partial stack trace while "stuck", cpu is on max:
> >>
> >> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
> >> Source)
> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
> >> org.apache.lucene.search.Searcher.search(Unknown Source)
> >>
> >>
> >>
> >>
> >>
> >> - Original Message 
> >> > From: eks dev
> >> > To: java-user@lucene.apache.org
> >> > Sent: Monday, 13 July, 2009 13:28:45
> >> > Subject: Re: speed of BooleanQueries on 2.9
> >> >
> >> > Hi Mike,
> >> >
> >> > getMaxNumOfCandidates() in test was 200, Index is optimised and read-only
> >> >
> >> > We found (due to an error in our warm-up code, funny) that only this 
> >> > Query
> >> runs
> >> > slower on 2.9.
> >> >
> >> > A hint where to look could be that this Query cointains two, the most 
> >> > frequent
> >>
> >> > tokens in two particular fields
> >> > NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 3Mio
> >> unique
> >> > terms)
> >> >
> >> > But all of this *could be just wrong measurement*, I just could not 
> >> > spend more
> >>
> >> > time to get to the bottom of this. We moved forward as we got overall 
> >> > better
> >> > average performance (sweet 10% in average) on much bigger real query log 
> >> > from
> >> > our regression test.
> >> >
> >> > Anyhow I just wanted to throw it out, maybe it triggers some synapses :) 
> >> > If
> >> > false alarm, sorry.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > - Original Message 
> >> > > From: Michael McCandless
> >> > > To: java-user@lucene.apache.org
> >> > > Sent: Monday, 13 July, 2009 11:50:48
> >> > > Subject: Re: speed of BooleanQueries on 2.9
> >> > >
> >> > > This is not expected; 2.9 has had a number of changes that ought to
> >> > > reduce CPU cost of searching.  If this holds up we definitely need to
> >> > > get to the root cause.
> >> > >
> >> 

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

sorry for confusion, here is exact query that runs forever with 
setAllowDocsOutOfOrder:
You see it on stack trace taken while "stuck" 
o.a.l.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(UnknownSource)


Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 
NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:marea^0.2834352 
NAME:marei^0.25171682 NAME:mareo^0.25171682 NAME:mareu^0.25171682 
NAME:marie^0.28577283 NAME:marieh^0.2451648 NAME:mariha^0.2583552 
NAME:mariu^0.27189124 NAME:marja^0.2834352 NAME:marje^0.2673408 
NAME:marji^0.25171682 NAME:marjo^0.25171682 NAME:marju^0.25171682 
NAME:marla^0.2673408 NAME:marle^0.25171682 NAME:marli^0.2365632 
NAME:marlo^0.2365632 NAME:maroa^0.2673408 NAME:maroe^0.25171682 
NAME:maroi^0.2365632 NAME:marou^0.2365632 NAME:marua^0.2673408 
NAME:marue^0.25171682 NAME:marui^0.2365632 NAME:maruo^0.2365632 
NAME:marye^0.2673408 NAME:maryi^0.25171682 NAME:maryo^0.25171682 
NAME:meria^0.2787888 NAME:miria^0.25835523 NAME:moria^0.25835523 
NAME:muria^0.25835523 NAME:naria^0.27648002 NAME:narie^0.25392002 
NAME:neria^0.25392002) (NAME:piekarski NAME:bekarski^0.1922 
NAME:beugarski^0.20281483 NAME:blacharski^0.1922
 NAME:lekarski^0.1922 NAME:pecarski^0.21294187 NAME:peikarski^0.27648002 
NAME:pekarska^0.20172001 NAME:pekarski^0.22446752 NAME:pekarskj^0.21294187 
NAME:pekarsky^0.21294187 NAME:pickarske^0.21168004 NAME:pickarski^0.22073482 
NAME:piekalski^0.23941332 NAME:piekanski^0.23941332 NAME:piekaraka^0.2255 
NAME:piekarsci^0.29205337 NAME:piekarska^0.28421336 NAME:piekarskie^0.25392002 
NAME:piekarsky^0.29205337 NAME:piekarzcyk^0.23232001 NAME:piekarzki^0.29205337 
NAME:piekaski^0.24843001 NAME:piekavska^0.2255 NAME:piekorski^0.28421336 
NAME:pielarski^0.22997928 NAME:pierarski^0.22997928 NAME:pierkarski^0.24661335 
NAME:piesarski^0.22997928 NAME:pietarski^0.22997928 NAME:pietkarski^0.24661335 
NAME:pikarski^0.23232001 NAME:piowarski^0.20281483 NAME:pirkarski^0.22073482 
NAME:plocharski^0.21168004 NAME:pokarski^0.20172001 NAME:polikarski^0.20172001 
NAME:pukarski^0.20172001 NAME:pyekarska^0.26508 NAME:siekarski^0.20281483))^2.0)





- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 17:16:23
> Subject: Re: speed of BooleanQueries on 2.9
> 
> So now I'm confused.  Since your query has required (+) clauses, the
> setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.
> 
> BooleanQuery only uses BooleanScorer when there are no required terms,
> and allowDocsOutOfOrder is true.  So I can't explain why you see this
> setting changing anything on this query...
> 
> Mike
> 
> On Tue, Jul 14, 2009 at 7:04 PM, eks devwrote:
> >
> > I do not know exactly why, but
> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
> > with 
> setAllowDocsOutOfOrder(false);  no problems whatsoever
> >
> > not really scientific method to find such bug, but does the job and makes 
> > me 
> happy.
> >
> > Empirical, "deprecated methods are not to be taken as thoroughly tested, as 
> they have short life expectancy"
> >
> >
> >
> >
> >
> > - Original Message 
> >> From: eks dev 
> >> To: java-user@lucene.apache.org
> >> Sent: Wednesday, 15 July, 2009 0:24:43
> >> Subject: Re: speed of BooleanQueries on 2.9
> >>
> >>
> >> Mike, we are definitely hitting something with this one!
> >>
> >> we had report from our QA chaps that our servers got stuck (limit is on 180
> >> Seconds Request)... We are on average 14 Requsts per second has 
> >> nothing 
> to
> >> do with gc() as
> >> we can repeat it with freshly restarted searcher.
> >>
> >> - it happens on a less than 0.1% of queries, not much of a  pattern, 
> repeatable
> >> on our index...
> >> it is always combination of two expanded tokens (we use
> >> minimumNooShouldMatch)...
> >>
> >> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
> >> all tokens are with set boost, and  minNumShouldMatch is set to two
> >>
> >> I cannot provide self-contained test, nor index (contains sensitive data 
> >> and 
> is
> >> rather big, ~5G)
> >>
> >> I can repeat this test on t1 and t2 with 40 expansions each. even if I 
> >> take 
> the
> >> most frequent tokens in collection it runs well under one second...but 
> >> these 
> two
> >> particular tokens with their "expansions" are making it run forever...
> >>
> >> and yes, if I run t1 plus expansions only, it runs super fast, the same 
> >> for 
> t2
> >>
> >> java 1.4U14, tried wit 1.6U6, no changes...
> >>
> >> will report if I dig something out
> >>
> >> partial stack trace while "stuck", cpu is on max:
> >>
> >> 
> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
> >> Source)
> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> >> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> >> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
> >> org.apache

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
On Tue, Jul 14, 2009 at 6:24 PM, eks dev wrote:

> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
>  Source)
> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
> org.apache.lucene.search.Searcher.search(Unknown Source)

This stack trace also confirms you are somehow using BooleanScorer,
but I don't see how that query can do that.  Hmm.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[REMINDER] NYC Meetup July 22nd

2009-07-15 Thread Grant Ingersoll

For those in NYC, there will be a Lucene ecosystem (Lucene/Solr/Mahout/
Nutch/Tika/Droids/Lucene ports) Meetup on July 22, hosted by MTV
Networks and co-sponsored with Lucid Imagination.

For more info and to RSVP, see http://www.meetup.com/NYC-Apache-Lucene-Solr-Meetup/ 
.  There is limited seating, so get your spot early.   Note, you must

register with your first and last name so that security badges can be
printed ahead of time for access.

Cheers,
Grant

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
So now I'm confused.  Since your query has required (+) clauses, the
setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.

BooleanQuery only uses BooleanScorer when there are no required terms,
and allowDocsOutOfOrder is true.  So I can't explain why you see this
setting changing anything on this query...

Mike

On Tue, Jul 14, 2009 at 7:04 PM, eks dev wrote:
>
> I do not know exactly why, but
> when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
> with setAllowDocsOutOfOrder(false);  no problems whatsoever
>
> not really scientific method to find such bug, but does the job and makes me 
> happy.
>
> Empirical, "deprecated methods are not to be taken as thoroughly tested, as 
> they have short life expectancy"
>
>
>
>
>
> - Original Message 
>> From: eks dev 
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, 15 July, 2009 0:24:43
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>>
>> Mike, we are definitely hitting something with this one!
>>
>> we had report from our QA chaps that our servers got stuck (limit is on 180
>> Seconds Request)... We are on average 14 Requsts per second has nothing 
>> to
>> do with gc() as
>> we can repeat it with freshly restarted searcher.
>>
>> - it happens on a less than 0.1% of queries, not much of a  pattern, 
>> repeatable
>> on our index...
>> it is always combination of two expanded tokens (we use
>> minimumNooShouldMatch)...
>>
>> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
>> all tokens are with set boost, and  minNumShouldMatch is set to two
>>
>> I cannot provide self-contained test, nor index (contains sensitive data and 
>> is
>> rather big, ~5G)
>>
>> I can repeat this test on t1 and t2 with 40 expansions each. even if I take 
>> the
>> most frequent tokens in collection it runs well under one second...but these 
>> two
>> particular tokens with their "expansions" are making it run forever...
>>
>> and yes, if I run t1 plus expansions only, it runs super fast, the same for 
>> t2
>>
>> java 1.4U14, tried wit 1.6U6, no changes...
>>
>> will report if I dig something out
>>
>> partial stack trace while "stuck", cpu is on max:
>>
>> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
>> Source)
>> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> org.apache.lucene.search.Searcher.search(Unknown Source)
>>
>>
>>
>>
>>
>> - Original Message 
>> > From: eks dev
>> > To: java-user@lucene.apache.org
>> > Sent: Monday, 13 July, 2009 13:28:45
>> > Subject: Re: speed of BooleanQueries on 2.9
>> >
>> > Hi Mike,
>> >
>> > getMaxNumOfCandidates() in test was 200, Index is optimised and read-only
>> >
>> > We found (due to an error in our warm-up code, funny) that only this Query
>> runs
>> > slower on 2.9.
>> >
>> > A hint where to look could be that this Query cointains two, the most 
>> > frequent
>>
>> > tokens in two particular fields
>> > NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 3Mio
>> unique
>> > terms)
>> >
>> > But all of this *could be just wrong measurement*, I just could not spend 
>> > more
>>
>> > time to get to the bottom of this. We moved forward as we got overall 
>> > better
>> > average performance (sweet 10% in average) on much bigger real query log 
>> > from
>> > our regression test.
>> >
>> > Anyhow I just wanted to throw it out, maybe it triggers some synapses :) If
>> > false alarm, sorry.
>> >
>> >
>> >
>> >
>> >
>> > - Original Message 
>> > > From: Michael McCandless
>> > > To: java-user@lucene.apache.org
>> > > Sent: Monday, 13 July, 2009 11:50:48
>> > > Subject: Re: speed of BooleanQueries on 2.9
>> > >
>> > > This is not expected; 2.9 has had a number of changes that ought to
>> > > reduce CPU cost of searching.  If this holds up we definitely need to
>> > > get to the root cause.
>> > >
>> > > Did your test exclude the warmup query for both 2.4.1 & 2.9?  How many
>> > > segments in the index?  What is the actual value of
>> > > getMaxNumOfCandidates()?  If you simplify the query down (eg just do
>> > > the NAME clause or the ZIPSS clause, alone) are those also 4X slower?
>> > >
>> > > Mike
>> > >
>> > > On Sun, Jul 12, 2009 at 12:53 PM, eks devwrote:
>> > > >
>> > > > Is it possible that the same BooleanQuery on 2.9 runs significantly 
>> > > > slower
>>
>> > > than on 2.4?
>> > > >
>> > > > we have some strange effects where the following query runs approx
>> 4(ouch!)
>> > > times slower on 2.9, test done by 1000 times executing the same Query...
>> But!
>> > if
>> > > I run test from some real Query log with mixed Queries, I get almost the
>> same
>> > > results (?!), even slightly faster on 2.9 !?
>> > > >
>> > > >
>> > > > Query:
>> > > > +((NAME:hans NAME:hahns^0.23232001 NAM

Re: Stream field values

2009-07-15 Thread Günter Ladwig

Hi,

thanks for your answer. I know about lazy loading fields, but my  
question is whether fields are always loaded as a whole or if it is  
possible in some way to stream a field's contents.


Regards,
Günter
--
Dipl.-Inform. Günter Ladwig
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
Phone +49 (0)721 608 7946 Building 11.40, Room 250
g...@aifb.uni-karlsruhe.dewww.aifb.uni-karlsruhe.de



On 14.07.2009, at 19:21, Grant Ingersoll wrote:

Have a look at the FieldSelector and the Lazy load capability.  See http://www.lucidimagination.com/search/?q=FieldSelector 
 for some pointers.


-Grant
On Jul 14, 2009, at 11:12 AM, Günter Ladwig wrote:


Hi,

I have a situation, where stored, un-indexed fields can contain  
potentially large amounts of data. Is it possibly to read the  
contents of a field incrementally? That is, do not load the  
complete contents from disk, but read X bytes at a time. Does the  
Reader returned by Field.readerValue() work that way? Or is this  
only possible with tokenized fields?


Thanks!

Regards,
Günter
--
Dipl.-Inform. Günter Ladwig
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
Phone +49 (0)721 608 7946 Building 11.40, Room 250
g...@aifb.uni-karlsruhe.dewww.aifb.uni-karlsruhe.de




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread eks dev

something weird happening w/ BooleanScorer...

indeed, my first impression was jvm bug triggered on some rare conditions... 
but we tried old jvm (1.5).. the latest 1.6 U14 , -client instead of -XBatch 
-serverno changes 

We never managed to wait so long to see it  finish, so I am not sure if we are 
talking about some dead loop, or just extremely slow

the good thing is that we can reproduce it, I asked our QA to keep exact copy  
of this index if we con help somehow just let me now 





- Original Message 
> From: Michael McCandless 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 15 July, 2009 13:30:22
> Subject: Re: speed of BooleanQueries on 2.9
> 
> On Tue, Jul 14, 2009 at 7:04 PM, eks devwrote:
> >
> > I do not know exactly why, but
> > when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
> > with 
> setAllowDocsOutOfOrder(false);  no problems whatsoever
> 
> That toggles between using BooleanScorer vs BooleanScorer2.
> 
> The odd thing is it's especially queries like yours (many OR'd terms)
> that BooleanScorer's peformance should shine compared to
> BooleanScorer2.
> 
> Yet you're seeing something weird happening w/ BooleanScorer.
> 
> Mike
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Stream field values

2009-07-15 Thread Günter Ladwig

Hi,

thanks for your answer. I know about lazy loading fields, but my  
question is whether fields are always loaded as a whole or if it is  
possible in some way to stream a field's contents.


Regards,
Günter
--
Dipl.-Inform. Günter Ladwig
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
Phone +49 (0)721 608 7946 Building 11.40, Room 250
g...@aifb.uni-karlsruhe.dewww.aifb.uni-karlsruhe.de



On 14.07.2009, at 19:21, Grant Ingersoll wrote:

Have a look at the FieldSelector and the Lazy load capability.  See http://www.lucidimagination.com/search/?q=FieldSelector 
 for some pointers.


-Grant
On Jul 14, 2009, at 11:12 AM, Günter Ladwig wrote:


Hi,

I have a situation, where stored, un-indexed fields can contain  
potentially large amounts of data. Is it possibly to read the  
contents of a field incrementally? That is, do not load the  
complete contents from disk, but read X bytes at a time. Does the  
Reader returned by Field.readerValue() work that way? Or is this  
only possible with tokenized fields?


Thanks!

Regards,
Günter
--
Dipl.-Inform. Günter Ladwig
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
Phone +49 (0)721 608 7946 Building 11.40, Room 250
g...@aifb.uni-karlsruhe.dewww.aifb.uni-karlsruhe.de




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom FieldComparator and incorrect sort order

2009-07-15 Thread Michael McCandless
On Wed, Jul 15, 2009 at 7:51 AM, Shalin Shekhar
Mangar wrote:
> On Wed, Jul 15, 2009 at 4:49 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> OK I opened & fixed https://issues.apache.org/jira/browse/LUCENE-1744.
>>
>>
> Wow, that was fast! Thanks!

Well, I had the easy part ;)  You had the hard part!  Thanks.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom FieldComparator and incorrect sort order

2009-07-15 Thread Shalin Shekhar Mangar
On Wed, Jul 15, 2009 at 4:49 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> OK I opened & fixed https://issues.apache.org/jira/browse/LUCENE-1744.
>
>
Wow, that was fast! Thanks!

-- 
Regards,
Shalin Shekhar Mangar.


Re: Index doubling in size when adding extra terms

2009-07-15 Thread Michael McCandless
It looks like your "text_substrings" field will have many more unique
terms than the original text, right?  And, since it's indexed (I
assume), the docIDs will in fact be stored twice (once in the postings
for your orig text and once in the postings for text_substrings).  So
I think it's expected that the postings (*.frq/.prx/.tis/.tii) would
at least double in size.

Mike

On Wed, Jul 15, 2009 at 6:48 AM, Gregory Tarr wrote:
> I have added a new field to each document in my index containing
> substrings of another field to speed up initial-wildcard searches.
>
> Each document has a field "text" which might contain "the quick brown
> fox jumped over the lazy dogs"
> The new field - "text_substrings" would then contain "the quick uick ick
> brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"
>
> This allows me to convert initial wildcard queries "*own" into a term
> query "own".
>
> However adding this field has exactly doubled the size of the index.
> Given that the term list is a small fraction of the index (?), I find
> this strange. I think it might be storing the documents twice.
>
> Is there any way to stop this from happening?
>
> Thanks
>
> Greg Tarr
>
>
>
>
> This message should be regarded as confidential. If you have received this 
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy by 
> an authorised signatory.  The contents of this email may relate to dealings 
> with other companies within the Detica Limited group of companies.
>
> Detica Limited is registered in England under No: 1337451.
>
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
On Tue, Jul 14, 2009 at 7:04 PM, eks dev wrote:
>
> I do not know exactly why, but
> when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
> with setAllowDocsOutOfOrder(false);  no problems whatsoever

That toggles between using BooleanScorer vs BooleanScorer2.

The odd thing is it's especially queries like yours (many OR'd terms)
that BooleanScorer's peformance should shine compared to
BooleanScorer2.

Yet you're seeing something weird happening w/ BooleanScorer.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Michael McCandless
OK I'll dig on this one.  Maybe I can repro w/ a Wikipedia index.

Mike

On Tue, Jul 14, 2009 at 7:04 PM, eks dev wrote:
>
> I do not know exactly why, but
> when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
> with setAllowDocsOutOfOrder(false);  no problems whatsoever
>
> not really scientific method to find such bug, but does the job and makes me 
> happy.
>
> Empirical, "deprecated methods are not to be taken as thoroughly tested, as 
> they have short life expectancy"
>
>
>
>
>
> - Original Message 
>> From: eks dev 
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, 15 July, 2009 0:24:43
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>>
>> Mike, we are definitely hitting something with this one!
>>
>> we had report from our QA chaps that our servers got stuck (limit is on 180
>> Seconds Request)... We are on average 14 Requsts per second has nothing 
>> to
>> do with gc() as
>> we can repeat it with freshly restarted searcher.
>>
>> - it happens on a less than 0.1% of queries, not much of a  pattern, 
>> repeatable
>> on our index...
>> it is always combination of two expanded tokens (we use
>> minimumNooShouldMatch)...
>>
>> (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
>> all tokens are with set boost, and  minNumShouldMatch is set to two
>>
>> I cannot provide self-contained test, nor index (contains sensitive data and 
>> is
>> rather big, ~5G)
>>
>> I can repeat this test on t1 and t2 with 40 expansions each. even if I take 
>> the
>> most frequent tokens in collection it runs well under one second...but these 
>> two
>> particular tokens with their "expansions" are making it run forever...
>>
>> and yes, if I run t1 plus expansions only, it runs super fast, the same for 
>> t2
>>
>> java 1.4U14, tried wit 1.6U6, no changes...
>>
>> will report if I dig something out
>>
>> partial stack trace while "stuck", cpu is on max:
>>
>> org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
>> Source)
>> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> org.apache.lucene.search.BooleanScorer.score(Unknown Source)
>> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> org.apache.lucene.search.IndexSearcher.search(Unknown Source)
>> org.apache.lucene.search.Searcher.search(Unknown Source)
>>
>>
>>
>>
>>
>> - Original Message 
>> > From: eks dev
>> > To: java-user@lucene.apache.org
>> > Sent: Monday, 13 July, 2009 13:28:45
>> > Subject: Re: speed of BooleanQueries on 2.9
>> >
>> > Hi Mike,
>> >
>> > getMaxNumOfCandidates() in test was 200, Index is optimised and read-only
>> >
>> > We found (due to an error in our warm-up code, funny) that only this Query
>> runs
>> > slower on 2.9.
>> >
>> > A hint where to look could be that this Query cointains two, the most 
>> > frequent
>>
>> > tokens in two particular fields
>> > NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 3Mio
>> unique
>> > terms)
>> >
>> > But all of this *could be just wrong measurement*, I just could not spend 
>> > more
>>
>> > time to get to the bottom of this. We moved forward as we got overall 
>> > better
>> > average performance (sweet 10% in average) on much bigger real query log 
>> > from
>> > our regression test.
>> >
>> > Anyhow I just wanted to throw it out, maybe it triggers some synapses :) If
>> > false alarm, sorry.
>> >
>> >
>> >
>> >
>> >
>> > - Original Message 
>> > > From: Michael McCandless
>> > > To: java-user@lucene.apache.org
>> > > Sent: Monday, 13 July, 2009 11:50:48
>> > > Subject: Re: speed of BooleanQueries on 2.9
>> > >
>> > > This is not expected; 2.9 has had a number of changes that ought to
>> > > reduce CPU cost of searching.  If this holds up we definitely need to
>> > > get to the root cause.
>> > >
>> > > Did your test exclude the warmup query for both 2.4.1 & 2.9?  How many
>> > > segments in the index?  What is the actual value of
>> > > getMaxNumOfCandidates()?  If you simplify the query down (eg just do
>> > > the NAME clause or the ZIPSS clause, alone) are those also 4X slower?
>> > >
>> > > Mike
>> > >
>> > > On Sun, Jul 12, 2009 at 12:53 PM, eks devwrote:
>> > > >
>> > > > Is it possible that the same BooleanQuery on 2.9 runs significantly 
>> > > > slower
>>
>> > > than on 2.4?
>> > > >
>> > > > we have some strange effects where the following query runs approx
>> 4(ouch!)
>> > > times slower on 2.9, test done by 1000 times executing the same Query...
>> But!
>> > if
>> > > I run test from some real Query log with mixed Queries, I get almost the
>> same
>> > > results (?!), even slightly faster on 2.9 !?
>> > > >
>> > > >
>> > > > Query:
>> > > > +((NAME:hans NAME:hahns^0.23232001 NAME:hams^0.27648002 
>> > > > NAME:hamz^0.25392
>> > > NAME:hanas^0.18722998 NAME:hanbs^0.18722998 NAME:hanfs^0.18722998
>> > > NAME:hangs^0.18722998 NAME:hanhs^0.24030754 NAME:hanis^0.18722998
>> > > NAME:hanjs^0.18722998 NAME:hanks^0.18722998 NAME:hanms^0.18

Re: Custom FieldComparator and incorrect sort order

2009-07-15 Thread Michael McCandless
OK I opened & fixed https://issues.apache.org/jira/browse/LUCENE-1744.

Thanks Shalin!

Mike

On Wed, Jul 15, 2009 at 7:04 AM, Michael
McCandless wrote:
> OK this is a bug in BooleanScorer2!  I'll open it shortly... thanks Shalin!
>
> Mike
>
> On Wed, Jul 15, 2009 at 6:32 AM, Michael
> McCandless wrote:
>> I'll look into this...
>>
>> Mike
>>
>> On Wed, Jul 15, 2009 at 3:55 AM, Shalin Shekhar
>> Mangar wrote:
>>> Hello,
>>>
>>> Over in Solr land, I'm facing a problem while upgrading the lucene version
>>> to trunk. Solr has a QueryElevationComponent which is used to boost certain
>>> documents to the top. It pre-processes the query to add a few boolean
>>> clauses of its own and uses a FieldComparator for the sorting part. This
>>> worked fine before the upgrade. There's a test which fixes the position of
>>> two docs and then sorts on score ascending. After the upgrade, the score asc
>>> does not seem to take effect and documents are sorted by score descending.
>>>
>>> I've tried to remove the solr baggage in the following code. Changing the
>>> score sort to ascending/descending gives the exact same order of the
>>> results. Any ideas on what may be the problem?
>>>
>>> package org.apache.solr;
>>>
>>> import org.apache.lucene.analysis.WhitespaceAnalyzer;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.index.IndexWriter;
>>> import org.apache.lucene.index.Term;
>>> import org.apache.lucene.search.*;
>>> import org.apache.lucene.store.RAMDirectory;
>>> import org.junit.Test;
>>>
>>> import java.io.IOException;
>>> import java.util.HashMap;
>>> import java.util.Map;
>>>
>>> public class TestSort {
>>>
>>>  private final Map priority = new HashMap>> Integer>();
>>>
>>> �...@test
>>>  public void testSorting() throws IOException {
>>>    RAMDirectory directory = new RAMDirectory();
>>>    IndexWriter writer = new IndexWriter(directory, new
>>> WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
>>>    writer.setMaxBufferedDocs(2);
>>>    writer.setMergeFactor(1000);
>>>    writer.addDocument(adoc("id", "a", "title", "ipod", "str_s", "a"));
>>>    writer.addDocument(adoc("id", "b", "title", "ipod ipod", "str_s", "b"));
>>>    writer.addDocument(adoc("id", "c", "title", "ipod ipod ipod", "str_s",
>>> "c"));
>>>    writer.addDocument(adoc("id", "x", "title", "boosted", "str_s", "x"));
>>>    writer.addDocument(adoc("id", "y", "title", "boosted boosted", "str_s",
>>> "y"));
>>>    writer.addDocument(adoc("id", "z", "title", "boosted boosted boosted",
>>> "str_s", "z"));
>>>    writer.close();
>>>
>>>    IndexSearcher searcher = new IndexSearcher(directory, true);
>>>    BooleanQuery newq = new BooleanQuery(false);
>>>    TermQuery query = new TermQuery(new Term("title", "ipod"));
>>>
>>>    newq.add(query, BooleanClause.Occur.SHOULD);
>>>    newq.add(getElevatedQuery("id", "a", "id", "x"),
>>> BooleanClause.Occur.SHOULD);
>>>
>>>    Sort sort = new Sort(new SortField[]{
>>>            new SortField("id", new ElevationComparatorSource(priority),
>>> false),
>>>            new SortField(null, SortField.SCORE, true)
>>>    });
>>>    TopDocsCollector topCollector = TopFieldCollector.create(sort, 50,
>>> false, true, true, true);
>>>    searcher.search(newq, null, topCollector);
>>>
>>>    TopDocs topDocs = topCollector.topDocs(0, 10);
>>>    int nDocsReturned = topDocs.scoreDocs.length;
>>>
>>>    int[] ids = new int[nDocsReturned];
>>>    float[] scores = new float[nDocsReturned];
>>>    Document[] documents = new Document[nDocsReturned];
>>>    for (int i = 0; i < nDocsReturned; i++) {
>>>      ScoreDoc scoreDoc = topDocs.scoreDocs[i];
>>>      ids[i] = scoreDoc.doc;
>>>      scores[i] = scoreDoc.score;
>>>      documents[i] = searcher.doc(ids[i]);
>>>      System.out.println("documents[i] = " + documents[i]);
>>>      System.out.println("scores[i] = " + scores[i]);
>>>    }
>>>
>>>    searcher.close();
>>>  }
>>>
>>>  private Query getElevatedQuery(String... vals) {
>>>    BooleanQuery q = new BooleanQuery(false);
>>>    q.setBoost(0);
>>>    int max = (vals.length / 2) + 5;
>>>    for (int i = 0; i < vals.length - 1; i += 2) {
>>>      q.add(new TermQuery(new Term(vals[i], vals[i + 1])),
>>> BooleanClause.Occur.SHOULD);
>>>      priority.put(vals[i + 1], max--);
>>>    }
>>>    return q;
>>>  }
>>>
>>>  private Document adoc(String... vals) {
>>>    Document doc = new Document();
>>>    for (int i = 0; i < vals.length - 2; i += 2) {
>>>      doc.add(new Field(vals[i], vals[i + 1], Field.Store.YES,
>>> Field.Index.ANALYZED));
>>>    }
>>>    return doc;
>>>  }
>>> }
>>>
>>> class ElevationComparatorSource extends FieldComparatorSource {
>>>  private final Map priority;
>>>
>>>  public ElevationComparatorSource(final Map boosts) {
>>>    this.priority = boosts;
>>>  }
>>>
>>>  public FieldComparator newComparator(final String fieldname, final int
>>> numHits, int sortPos, boolean 

Re: Custom FieldComparator and incorrect sort order

2009-07-15 Thread Michael McCandless
OK this is a bug in BooleanScorer2!  I'll open it shortly... thanks Shalin!

Mike

On Wed, Jul 15, 2009 at 6:32 AM, Michael
McCandless wrote:
> I'll look into this...
>
> Mike
>
> On Wed, Jul 15, 2009 at 3:55 AM, Shalin Shekhar
> Mangar wrote:
>> Hello,
>>
>> Over in Solr land, I'm facing a problem while upgrading the lucene version
>> to trunk. Solr has a QueryElevationComponent which is used to boost certain
>> documents to the top. It pre-processes the query to add a few boolean
>> clauses of its own and uses a FieldComparator for the sorting part. This
>> worked fine before the upgrade. There's a test which fixes the position of
>> two docs and then sorts on score ascending. After the upgrade, the score asc
>> does not seem to take effect and documents are sorted by score descending.
>>
>> I've tried to remove the solr baggage in the following code. Changing the
>> score sort to ascending/descending gives the exact same order of the
>> results. Any ideas on what may be the problem?
>>
>> package org.apache.solr;
>>
>> import org.apache.lucene.analysis.WhitespaceAnalyzer;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> import org.apache.lucene.index.IndexReader;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.index.Term;
>> import org.apache.lucene.search.*;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.junit.Test;
>>
>> import java.io.IOException;
>> import java.util.HashMap;
>> import java.util.Map;
>>
>> public class TestSort {
>>
>>  private final Map priority = new HashMap> Integer>();
>>
>> �...@test
>>  public void testSorting() throws IOException {
>>    RAMDirectory directory = new RAMDirectory();
>>    IndexWriter writer = new IndexWriter(directory, new
>> WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
>>    writer.setMaxBufferedDocs(2);
>>    writer.setMergeFactor(1000);
>>    writer.addDocument(adoc("id", "a", "title", "ipod", "str_s", "a"));
>>    writer.addDocument(adoc("id", "b", "title", "ipod ipod", "str_s", "b"));
>>    writer.addDocument(adoc("id", "c", "title", "ipod ipod ipod", "str_s",
>> "c"));
>>    writer.addDocument(adoc("id", "x", "title", "boosted", "str_s", "x"));
>>    writer.addDocument(adoc("id", "y", "title", "boosted boosted", "str_s",
>> "y"));
>>    writer.addDocument(adoc("id", "z", "title", "boosted boosted boosted",
>> "str_s", "z"));
>>    writer.close();
>>
>>    IndexSearcher searcher = new IndexSearcher(directory, true);
>>    BooleanQuery newq = new BooleanQuery(false);
>>    TermQuery query = new TermQuery(new Term("title", "ipod"));
>>
>>    newq.add(query, BooleanClause.Occur.SHOULD);
>>    newq.add(getElevatedQuery("id", "a", "id", "x"),
>> BooleanClause.Occur.SHOULD);
>>
>>    Sort sort = new Sort(new SortField[]{
>>            new SortField("id", new ElevationComparatorSource(priority),
>> false),
>>            new SortField(null, SortField.SCORE, true)
>>    });
>>    TopDocsCollector topCollector = TopFieldCollector.create(sort, 50,
>> false, true, true, true);
>>    searcher.search(newq, null, topCollector);
>>
>>    TopDocs topDocs = topCollector.topDocs(0, 10);
>>    int nDocsReturned = topDocs.scoreDocs.length;
>>
>>    int[] ids = new int[nDocsReturned];
>>    float[] scores = new float[nDocsReturned];
>>    Document[] documents = new Document[nDocsReturned];
>>    for (int i = 0; i < nDocsReturned; i++) {
>>      ScoreDoc scoreDoc = topDocs.scoreDocs[i];
>>      ids[i] = scoreDoc.doc;
>>      scores[i] = scoreDoc.score;
>>      documents[i] = searcher.doc(ids[i]);
>>      System.out.println("documents[i] = " + documents[i]);
>>      System.out.println("scores[i] = " + scores[i]);
>>    }
>>
>>    searcher.close();
>>  }
>>
>>  private Query getElevatedQuery(String... vals) {
>>    BooleanQuery q = new BooleanQuery(false);
>>    q.setBoost(0);
>>    int max = (vals.length / 2) + 5;
>>    for (int i = 0; i < vals.length - 1; i += 2) {
>>      q.add(new TermQuery(new Term(vals[i], vals[i + 1])),
>> BooleanClause.Occur.SHOULD);
>>      priority.put(vals[i + 1], max--);
>>    }
>>    return q;
>>  }
>>
>>  private Document adoc(String... vals) {
>>    Document doc = new Document();
>>    for (int i = 0; i < vals.length - 2; i += 2) {
>>      doc.add(new Field(vals[i], vals[i + 1], Field.Store.YES,
>> Field.Index.ANALYZED));
>>    }
>>    return doc;
>>  }
>> }
>>
>> class ElevationComparatorSource extends FieldComparatorSource {
>>  private final Map priority;
>>
>>  public ElevationComparatorSource(final Map boosts) {
>>    this.priority = boosts;
>>  }
>>
>>  public FieldComparator newComparator(final String fieldname, final int
>> numHits, int sortPos, boolean reversed) throws IOException {
>>    return new FieldComparator() {
>>
>>      FieldCache.StringIndex idIndex;
>>      private final int[] values = new int[numHits];
>>      int bottomVal;
>>
>>      public int compare(int slot1, int slot2) {
>>        return values[slot2] - values[slot1

RE: Index doubling in size when adding extra terms

2009-07-15 Thread Uwe Schindler
Field.Store.NO used for the text_substrings field? :-)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Gregory Tarr [mailto:gregory.t...@detica.com]
> Sent: Wednesday, July 15, 2009 12:49 PM
> To: java-user@lucene.apache.org
> Subject: Index doubling in size when adding extra terms
> 
> I have added a new field to each document in my index containing
> substrings of another field to speed up initial-wildcard searches.
> 
> Each document has a field "text" which might contain "the quick brown
> fox jumped over the lazy dogs"
> The new field - "text_substrings" would then contain "the quick uick ick
> brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"
> 
> This allows me to convert initial wildcard queries "*own" into a term
> query "own".
> 
> However adding this field has exactly doubled the size of the index.
> Given that the term list is a small fraction of the index (?), I find
> this strange. I think it might be storing the documents twice.
> 
> Is there any way to stop this from happening?
> 
> Thanks
> 
> Greg Tarr
> 
> 
> 
> 
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.  The contents of this email may relate to
> dealings with other companies within the Detica Limited group of
> companies.
> 
> Detica Limited is registered in England under No: 1337451.
> 
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
> England.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Index doubling in size when adding extra terms

2009-07-15 Thread Gregory Tarr
I have added a new field to each document in my index containing
substrings of another field to speed up initial-wildcard searches.

Each document has a field "text" which might contain "the quick brown
fox jumped over the lazy dogs"
The new field - "text_substrings" would then contain "the quick uick ick
brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"

This allows me to convert initial wildcard queries "*own" into a term
query "own".

However adding this field has exactly doubled the size of the index.
Given that the term list is a small fraction of the index (?), I find
this strange. I think it might be storing the documents twice.

Is there any way to stop this from happening?

Thanks

Greg Tarr




This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by 
an authorised signatory.  The contents of this email may relate to dealings 
with other companies within the Detica Limited group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.



Re: re-ranking ....

2009-07-15 Thread KK
fetch all the search results along with their corresponding values for all
the terms used for scoring and then you use those values and play-around
with them and re-rank your results to your hearts content/wish.

--kk

On Wed, Jul 15, 2009 at 11:28 AM, henok sahilu wrote:

> what i want to do is re-rank the lucene result set based on my algorithm
> that i will write.
> i have some rules and based on these rules i want lucene result set be
> reordered.
> thanks
>
>
>
>
>


Re: Custom FieldComparator and incorrect sort order

2009-07-15 Thread Michael McCandless
I'll look into this...

Mike

On Wed, Jul 15, 2009 at 3:55 AM, Shalin Shekhar
Mangar wrote:
> Hello,
>
> Over in Solr land, I'm facing a problem while upgrading the lucene version
> to trunk. Solr has a QueryElevationComponent which is used to boost certain
> documents to the top. It pre-processes the query to add a few boolean
> clauses of its own and uses a FieldComparator for the sorting part. This
> worked fine before the upgrade. There's a test which fixes the position of
> two docs and then sorts on score ascending. After the upgrade, the score asc
> does not seem to take effect and documents are sorted by score descending.
>
> I've tried to remove the solr baggage in the following code. Changing the
> score sort to ascending/descending gives the exact same order of the
> results. Any ideas on what may be the problem?
>
> package org.apache.solr;
>
> import org.apache.lucene.analysis.WhitespaceAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.*;
> import org.apache.lucene.store.RAMDirectory;
> import org.junit.Test;
>
> import java.io.IOException;
> import java.util.HashMap;
> import java.util.Map;
>
> public class TestSort {
>
>  private final Map priority = new HashMap Integer>();
>
> �...@test
>  public void testSorting() throws IOException {
>    RAMDirectory directory = new RAMDirectory();
>    IndexWriter writer = new IndexWriter(directory, new
> WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
>    writer.setMaxBufferedDocs(2);
>    writer.setMergeFactor(1000);
>    writer.addDocument(adoc("id", "a", "title", "ipod", "str_s", "a"));
>    writer.addDocument(adoc("id", "b", "title", "ipod ipod", "str_s", "b"));
>    writer.addDocument(adoc("id", "c", "title", "ipod ipod ipod", "str_s",
> "c"));
>    writer.addDocument(adoc("id", "x", "title", "boosted", "str_s", "x"));
>    writer.addDocument(adoc("id", "y", "title", "boosted boosted", "str_s",
> "y"));
>    writer.addDocument(adoc("id", "z", "title", "boosted boosted boosted",
> "str_s", "z"));
>    writer.close();
>
>    IndexSearcher searcher = new IndexSearcher(directory, true);
>    BooleanQuery newq = new BooleanQuery(false);
>    TermQuery query = new TermQuery(new Term("title", "ipod"));
>
>    newq.add(query, BooleanClause.Occur.SHOULD);
>    newq.add(getElevatedQuery("id", "a", "id", "x"),
> BooleanClause.Occur.SHOULD);
>
>    Sort sort = new Sort(new SortField[]{
>            new SortField("id", new ElevationComparatorSource(priority),
> false),
>            new SortField(null, SortField.SCORE, true)
>    });
>    TopDocsCollector topCollector = TopFieldCollector.create(sort, 50,
> false, true, true, true);
>    searcher.search(newq, null, topCollector);
>
>    TopDocs topDocs = topCollector.topDocs(0, 10);
>    int nDocsReturned = topDocs.scoreDocs.length;
>
>    int[] ids = new int[nDocsReturned];
>    float[] scores = new float[nDocsReturned];
>    Document[] documents = new Document[nDocsReturned];
>    for (int i = 0; i < nDocsReturned; i++) {
>      ScoreDoc scoreDoc = topDocs.scoreDocs[i];
>      ids[i] = scoreDoc.doc;
>      scores[i] = scoreDoc.score;
>      documents[i] = searcher.doc(ids[i]);
>      System.out.println("documents[i] = " + documents[i]);
>      System.out.println("scores[i] = " + scores[i]);
>    }
>
>    searcher.close();
>  }
>
>  private Query getElevatedQuery(String... vals) {
>    BooleanQuery q = new BooleanQuery(false);
>    q.setBoost(0);
>    int max = (vals.length / 2) + 5;
>    for (int i = 0; i < vals.length - 1; i += 2) {
>      q.add(new TermQuery(new Term(vals[i], vals[i + 1])),
> BooleanClause.Occur.SHOULD);
>      priority.put(vals[i + 1], max--);
>    }
>    return q;
>  }
>
>  private Document adoc(String... vals) {
>    Document doc = new Document();
>    for (int i = 0; i < vals.length - 2; i += 2) {
>      doc.add(new Field(vals[i], vals[i + 1], Field.Store.YES,
> Field.Index.ANALYZED));
>    }
>    return doc;
>  }
> }
>
> class ElevationComparatorSource extends FieldComparatorSource {
>  private final Map priority;
>
>  public ElevationComparatorSource(final Map boosts) {
>    this.priority = boosts;
>  }
>
>  public FieldComparator newComparator(final String fieldname, final int
> numHits, int sortPos, boolean reversed) throws IOException {
>    return new FieldComparator() {
>
>      FieldCache.StringIndex idIndex;
>      private final int[] values = new int[numHits];
>      int bottomVal;
>
>      public int compare(int slot1, int slot2) {
>        return values[slot2] - values[slot1];  // values will be small
> enough that there is no overflow concern
>      }
>
>      public void setBottom(int slot) {
>        bottomVal = values[slot];
>      }
>
>      private int docVal(int doc) throws IOException {
>        String id = idIndex.lookup[idIndex.order[doc]

Custom FieldComparator and incorrect sort order

2009-07-15 Thread Shalin Shekhar Mangar
Hello,

Over in Solr land, I'm facing a problem while upgrading the lucene version
to trunk. Solr has a QueryElevationComponent which is used to boost certain
documents to the top. It pre-processes the query to add a few boolean
clauses of its own and uses a FieldComparator for the sorting part. This
worked fine before the upgrade. There's a test which fixes the position of
two docs and then sorts on score ascending. After the upgrade, the score asc
does not seem to take effect and documents are sorted by score descending.

I've tried to remove the solr baggage in the following code. Changing the
score sort to ascending/descending gives the exact same order of the
results. Any ideas on what may be the problem?

package org.apache.solr;

import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class TestSort {

  private final Map priority = new HashMap();

  @Test
  public void testSorting() throws IOException {
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new
WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setMaxBufferedDocs(2);
writer.setMergeFactor(1000);
writer.addDocument(adoc("id", "a", "title", "ipod", "str_s", "a"));
writer.addDocument(adoc("id", "b", "title", "ipod ipod", "str_s", "b"));
writer.addDocument(adoc("id", "c", "title", "ipod ipod ipod", "str_s",
"c"));
writer.addDocument(adoc("id", "x", "title", "boosted", "str_s", "x"));
writer.addDocument(adoc("id", "y", "title", "boosted boosted", "str_s",
"y"));
writer.addDocument(adoc("id", "z", "title", "boosted boosted boosted",
"str_s", "z"));
writer.close();

IndexSearcher searcher = new IndexSearcher(directory, true);
BooleanQuery newq = new BooleanQuery(false);
TermQuery query = new TermQuery(new Term("title", "ipod"));

newq.add(query, BooleanClause.Occur.SHOULD);
newq.add(getElevatedQuery("id", "a", "id", "x"),
BooleanClause.Occur.SHOULD);

Sort sort = new Sort(new SortField[]{
new SortField("id", new ElevationComparatorSource(priority),
false),
new SortField(null, SortField.SCORE, true)
});
TopDocsCollector topCollector = TopFieldCollector.create(sort, 50,
false, true, true, true);
searcher.search(newq, null, topCollector);

TopDocs topDocs = topCollector.topDocs(0, 10);
int nDocsReturned = topDocs.scoreDocs.length;

int[] ids = new int[nDocsReturned];
float[] scores = new float[nDocsReturned];
Document[] documents = new Document[nDocsReturned];
for (int i = 0; i < nDocsReturned; i++) {
  ScoreDoc scoreDoc = topDocs.scoreDocs[i];
  ids[i] = scoreDoc.doc;
  scores[i] = scoreDoc.score;
  documents[i] = searcher.doc(ids[i]);
  System.out.println("documents[i] = " + documents[i]);
  System.out.println("scores[i] = " + scores[i]);
}

searcher.close();
  }

  private Query getElevatedQuery(String... vals) {
BooleanQuery q = new BooleanQuery(false);
q.setBoost(0);
int max = (vals.length / 2) + 5;
for (int i = 0; i < vals.length - 1; i += 2) {
  q.add(new TermQuery(new Term(vals[i], vals[i + 1])),
BooleanClause.Occur.SHOULD);
  priority.put(vals[i + 1], max--);
}
return q;
  }

  private Document adoc(String... vals) {
Document doc = new Document();
for (int i = 0; i < vals.length - 2; i += 2) {
  doc.add(new Field(vals[i], vals[i + 1], Field.Store.YES,
Field.Index.ANALYZED));
}
return doc;
  }
}

class ElevationComparatorSource extends FieldComparatorSource {
  private final Map priority;

  public ElevationComparatorSource(final Map boosts) {
this.priority = boosts;
  }

  public FieldComparator newComparator(final String fieldname, final int
numHits, int sortPos, boolean reversed) throws IOException {
return new FieldComparator() {

  FieldCache.StringIndex idIndex;
  private final int[] values = new int[numHits];
  int bottomVal;

  public int compare(int slot1, int slot2) {
return values[slot2] - values[slot1];  // values will be small
enough that there is no overflow concern
  }

  public void setBottom(int slot) {
bottomVal = values[slot];
  }

  private int docVal(int doc) throws IOException {
String id = idIndex.lookup[idIndex.order[doc]];
Integer prio = priority.get(id);
return prio == null ? 0 : prio.intValue();
  }

  public int compareBottom(int doc) throws IOException {
return docVal(doc) - bottomVal;
  }

  public void copy(int slot, int doc) throws IOException {