Re: Performance Improvement for Search using PriorityQueue

2007-12-12 Thread Shai Erera
Done (PriorityQueue-2.patch) Shai On Dec 12, 2007 1:46 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > I think it's fine to include those changes with this issue. > > Mike > > Shai Erera wrote: > > > Hi > > > > I created https://issues.apache.org/jira/browse/LUCENE-1089 and > > added a > >

Re: Performance Improvement for Search using PriorityQueue

2007-12-12 Thread Michael McCandless
I think it's fine to include those changes with this issue. Mike Shai Erera wrote: Hi I created https://issues.apache.org/jira/browse/LUCENE-1089 and added a patch. I noticed that we can replace the calls to insert() with insertWithOverflow() in several other places, like QualityQueries

Re: Performance Improvement for Search using PriorityQueue

2007-12-12 Thread Shai Erera
Hi I created https://issues.apache.org/jira/browse/LUCENE-1089 and added a patch. I noticed that we can replace the calls to insert() with insertWithOverflow() in several other places, like QualityQueriesFinder, FuzzyQuery and TopFieldDocCollector. I wasn't sure if that should be handled as part o

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Yonik Seeley
On Dec 11, 2007 1:21 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote: > On Tuesday 11 December 2007 14:32:12 Shai Erera wrote: > > For (1) - I can't explain it but I've run into documents with 0.0f scores. > > For (2) - this is a simple logic - if the lowest score in the queue is 'x' > > and you want to

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Timo Nentwig
On Tuesday 11 December 2007 14:32:12 Shai Erera wrote: > For (1) - I can't explain it but I've run into documents with 0.0f scores. > For (2) - this is a simple logic - if the lowest score in the queue is 'x' > and you want to top docs only, then there's no point in attempting to > insert a documen

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Michael McCandless
2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote: On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue": Hi Lucene's PQ implements two methods: put (assumes the PQ has room for the object) and insert (checks whether

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Shai Erera
last March: > > http://www.nabble.com/FieldSortedHitQueue-enhancement- > > to9733550.html#a9733550 > > > > Peter > > > > On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> > > wrote: > > > >> On Mon, Dec 10, 2007, Shai Erera wr

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Michael McCandless
eue-enhancement- to9733550.html#a9733550 Peter On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote: On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue": Hi Lucene's PQ implements two methods: put

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Michael McCandless
tected not private, plus the 2 tiny performance gains Nadav suggests below? Shai can you open a Jira issue & attach a patch for these changes? Thanks! Mike Nadav Har'El wrote: On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using Priority

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Peter Keegan
See my similar request from last March: http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550 Peter On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote: > On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for > Sear

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Nadav Har'El
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue": > Hi > > Lucene's PQ implements two methods: put (assumes the PQ has room for the > object) and insert (checks whether the object can be inserted etc.). The

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Shai Erera
For (1) - I can't explain it but I've run into documents with 0.0f scores. For (2) - this is a simple logic - if the lowest score in the queue is 'x' and you want to top docs only, then there's no point in attempting to insert a document with score lower than 'x' (it will not be added). Maybe I did

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Timo Nentwig
On Monday 10 December 2007 09:15:12 Paul Elschot wrote: > The current TopDocCollector only allocates a ScoreDoc when the given > score causes a new ScoreDoc to be added into the queue, but it does I actually wrote my own HitCollector and now wonder about TopDocCollector: public void collect(int

Re: Performance Improvement for Search using PriorityQueue

2007-12-11 Thread Shai Erera
Hi Back from the experiments lab with more results. I've used two indexes (1 and 10 million documents) and ran over the two 2000 queries. Each run was executed 4 times and I paste here the average of the latest 3 (to eliminate any caching that is done by the OS and to mimic systems that are alread

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 1:20 PM, Shai Erera wrote: Thanks for the info. Too bad I use Windows ... Just allocate a bunch of memory and free it. This linux, but something similar should work on windows: $ vmstat -S M procs ---memory-- r b swpd free buff cache 0 0 0

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Daniel Naber
On Montag, 10. Dezember 2007, Michael Busch wrote: > Reboot your machine ;-) That's what I usually do - if there's another > way I'd like to know as well! On Linux (kernel 2.6.16 and later), call: sync ; echo 3 > /proc/sys/vm/drop_caches Regards Daniel -- http://www.danielnaber.de -

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Thanks for the info. Too bad I use Windows ... On Dec 10, 2007 11:16 PM, robert engels <[EMAIL PROTECTED]> wrote: > On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches > On "OS X", use "purge", which is part of the CHUD tools. > On Windows, I think you're hosed. > > On Dec 10, 2007,

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread robert engels
On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches On "OS X", use "purge", which is part of the CHUD tools. On Windows, I think you're hosed. On Dec 10, 2007, at 3:06 PM, Michael Busch wrote: Shai Erera wrote: No I haven't done that (to be honest, I don't know how to do that ...

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
I see :-). It's just that "clear up the disk cache" sounds so professional, I assumed there's a way to do it that I don't know of ... :-) Thanks, I'll report again with a larger index measurement. On Dec 10, 2007 11:06 PM, Michael Busch <[EMAIL PROTECTED]> wrote: > Shai Erera wrote: > > No I have

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael Busch
Shai Erera wrote: > No I haven't done that (to be honest, I don't know how to do that ... :-) ). > That's the reason I ran both tests multiple times and reported the last run. > Reboot your machine ;-) That's what I usually do - if there's another way I'd like to know as well! -Michael

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
No I haven't done that (to be honest, I don't know how to do that ... :-) ). That's the reason I ran both tests multiple times and reported the last run. On Dec 10, 2007 10:24 PM, Mike Klaas <[EMAIL PROTECTED]> wrote: > On 10-Dec-07, at 12:11 PM, Shai Erera wrote: > > > Actually, queries on large

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 12:11 PM, Shai Erera wrote: Actually, queries on large indexes are not necessarily I/O bound. It depends on how much of the posting list is being read into memory at once. I'm not that familiar with the inner-most of Lucene, but let's assume a posting element takes 4 bytes

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread robert engels
I think you might be underestimating the IO cost a bit. A modern drive does 40-70 mb per second, but that is sequential reads. The seek time (9 ms is good) can kill you. Because of disk fragmentation, there is no guarantee that the posting is sequential on the disk. Obviously the OS and dr

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Well .. I suspect this behavior is due the nature of the index - 100K docs duplicated 10 times. Therefore at some point it hits the same documents (and scores). Like I said, tomorrow I'll re-run the test on a 10M unique docs index. I agree that 80 allocations are not much, but that's per query. The

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Actually, queries on large indexes are not necessarily I/O bound. It depends on how much of the posting list is being read into memory at once. I'm not that familiar with the inner-most of Lucene, but let's assume a posting element takes 4 bytes for docId and 2 more bytes per position in a document

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
Shai Erera wrote: Hi Well, I have results from a 1M index for now (the index contains 100K documents duplicated 10 times, so it's not the final test I'll run, but it still shows something). I ran 2000 short queries (2.4 keywords on average) on a 1M docs index, after 50 queries warm-up. Fol

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread robert engels
As has been pointed out on many threads, a modern JVM really doesn't need you to recycle objects, especially for small short lived objects. It is actually less efficient in many cases (since the variables need to be reinitialized). Using object pools (except when pooling external resources)

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 11:31 AM, Shai Erera wrote: As you can see, the actual allocation time is really negligible and there isn't much difference in the avg. running times of the queries. However, the *current* runs performed a lot worse at the beginning, before the OS cache warmed up. This s

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Hi Well, I have results from a 1M index for now (the index contains 100K documents duplicated 10 times, so it's not the final test I'll run, but it still shows something). I ran 2000 short queries (2.4 keywords on average) on a 1M docs index, after 50 queries warm-up. Following are the results: C

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Paul Elschot
On Monday 10 December 2007 09:19:43 Shai Erera wrote: > I agree w.r.t the current implementation, however in the worse case (as we > tend to consider when talking about computer algorithms), it will allocate a > ScoreDoc per result. With the overflow reuse, it will not allocate those > objects, no

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
OK, sounds like a plan, thanks! Yes, contrib/benchmark has EnwikiDocMaker to generate docs off the Wikipedia XML export file. Mike On Dec 10, 2007, at 7:03 AM, Shai Erera wrote: I have access to TREC. I can try this. W.r.t the large indexes - I don't have access to the data, just scenar

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Doron Cohen
I have a TREC index of 25M docs that I can try this with. Shai, if you can create an issue and upload a patch that I (and others) can experiment with, I will try a few queries on this index with and without your patch... Doron Michael McCandless <[EMAIL PROTECTED]> wrote on 10/12/2007 13:50:53:

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
I have access to TREC. I can try this. W.r.t the large indexes - I don't have access to the data, just scenarios our customers ran into the past. Does the benchmark package includes code to crawl Wikipedia? If not, do you have such code? I don't want to write it from scratch for this specific task.

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
I don't offhand. Working on the indexing side is so much easier :) You mentioned your experience with large indices & large result sets -- is that something you could draw on? There have also been discussions lately about finding real search logs we could use for exactly this reason, thou

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Do you have a dataset and queries I can test on? On Dec 10, 2007 1:16 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Shai Erera wrote: > > > No - I didn't try to populate an index with real data and run real > > queries > > (what is "real" after all?). I know from my experience of indexes wi

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
Shai Erera wrote: No - I didn't try to populate an index with real data and run real queries (what is "real" after all?). I know from my experience of indexes with several millions of documents where there are queries with several hundred thousands results (one query even hit 2.5 M documents

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
No - I didn't try to populate an index with real data and run real queries (what is "real" after all?). I know from my experience of indexes with several millions of documents where there are queries with several hundred thousands results (one query even hit 2.5 M documents). This is typical in sea

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
Have you done any tests on real queries to see what impact this improvement has in practice? Or, to measure how many ScoreDocs are "typically" allocated? Mike Shai Erera wrote: I agree w.r.t the current implementation, however in the worse case (as we tend to consider when talking abou

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
I agree w.r.t the current implementation, however in the worse case (as we tend to consider when talking about computer algorithms), it will allocate a ScoreDoc per result. With the overflow reuse, it will not allocate those objects, no matter what's the input it gets. Also, notice that there is a

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Paul Elschot
The current TopDocCollector only allocates a ScoreDoc when the given score causes a new ScoreDoc to be added into the queue, but it does not reuse anything that overflows out of the queue. So, reusing the overflow out of the queue should reduce object allocations. especially for indexes that tend t

Performance Improvement for Search using PriorityQueue

2007-12-09 Thread Shai Erera
Hi Lucene's PQ implements two methods: put (assumes the PQ has room for the object) and insert (checks whether the object can be inserted etc.). The implementation of insert() requires the application that uses it to allocate a new object every time it calls insert. Specifically, it cannot reuse t