Done (PriorityQueue-2.patch)
Shai
On Dec 12, 2007 1:46 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:
>
> I think it's fine to include those changes with this issue.
>
> Mike
>
> Shai Erera wrote:
>
> > Hi
> >
> > I created https://issues.apache.org/jira/browse/LUCENE-1089 and
> > added a
> >
I think it's fine to include those changes with this issue.
Mike
Shai Erera wrote:
Hi
I created https://issues.apache.org/jira/browse/LUCENE-1089 and
added a
patch.
I noticed that we can replace the calls to insert() with
insertWithOverflow() in several other places, like
QualityQueries
Hi
I created https://issues.apache.org/jira/browse/LUCENE-1089 and added a
patch.
I noticed that we can replace the calls to insert() with
insertWithOverflow() in several other places, like QualityQueriesFinder,
FuzzyQuery and TopFieldDocCollector. I wasn't sure if that should be handled
as part o
On Dec 11, 2007 1:21 PM, Timo Nentwig <[EMAIL PROTECTED]> wrote:
> On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
> > For (1) - I can't explain it but I've run into documents with 0.0f scores.
> > For (2) - this is a simple logic - if the lowest score in the queue is 'x'
> > and you want to
On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
> For (1) - I can't explain it but I've run into documents with 0.0f scores.
> For (2) - this is a simple logic - if the lowest score in the queue is 'x'
> and you want to top docs only, then there's no point in attempting to
> insert a documen
2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]>
wrote:
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance
Improvement for
Search using PriorityQueue":
Hi
Lucene's PQ implements two methods: put (assumes the PQ has room
for the
object) and insert (checks whether
last March:
> > http://www.nabble.com/FieldSortedHitQueue-enhancement-
> > to9733550.html#a9733550
> >
> > Peter
> >
> > On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]>
> > wrote:
> >
> >> On Mon, Dec 10, 2007, Shai Erera wr
eue-enhancement-
to9733550.html#a9733550
Peter
On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]>
wrote:
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance
Improvement for
Search using PriorityQueue":
Hi
Lucene's PQ implements two methods: put
tected not private, plus
the 2 tiny performance gains Nadav suggests below? Shai can you open
a Jira issue & attach a patch for these changes? Thanks!
Mike
Nadav Har'El wrote:
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance
Improvement for Search using Priority
See my similar request from last March:
http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550
Peter
On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote:
> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for
> Sear
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for
Search using PriorityQueue":
> Hi
>
> Lucene's PQ implements two methods: put (assumes the PQ has room for the
> object) and insert (checks whether the object can be inserted etc.). The
For (1) - I can't explain it but I've run into documents with 0.0f scores.
For (2) - this is a simple logic - if the lowest score in the queue is 'x'
and you want to top docs only, then there's no point in attempting to insert
a document with score lower than 'x' (it will not be added).
Maybe I did
On Monday 10 December 2007 09:15:12 Paul Elschot wrote:
> The current TopDocCollector only allocates a ScoreDoc when the given
> score causes a new ScoreDoc to be added into the queue, but it does
I actually wrote my own HitCollector and now wonder about TopDocCollector:
public void collect(int
Hi
Back from the experiments lab with more results. I've used two indexes (1
and 10 million documents) and ran over the two 2000 queries. Each run was
executed 4 times and I paste here the average of the latest 3 (to eliminate
any caching that is done by the OS and to mimic systems that are alread
On 10-Dec-07, at 1:20 PM, Shai Erera wrote:
Thanks for the info. Too bad I use Windows ...
Just allocate a bunch of memory and free it. This linux, but
something similar should work on windows:
$ vmstat -S M
procs ---memory--
r b swpd free buff cache
0 0 0
On Montag, 10. Dezember 2007, Michael Busch wrote:
> Reboot your machine ;-) That's what I usually do - if there's another
> way I'd like to know as well!
On Linux (kernel 2.6.16 and later), call:
sync ; echo 3 > /proc/sys/vm/drop_caches
Regards
Daniel
--
http://www.danielnaber.de
-
Thanks for the info. Too bad I use Windows ...
On Dec 10, 2007 11:16 PM, robert engels <[EMAIL PROTECTED]> wrote:
> On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches
> On "OS X", use "purge", which is part of the CHUD tools.
> On Windows, I think you're hosed.
>
> On Dec 10, 2007,
On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches
On "OS X", use "purge", which is part of the CHUD tools.
On Windows, I think you're hosed.
On Dec 10, 2007, at 3:06 PM, Michael Busch wrote:
Shai Erera wrote:
No I haven't done that (to be honest, I don't know how to do
that ...
I see :-). It's just that "clear up the disk cache" sounds so professional,
I assumed there's a way to do it that I don't know of ... :-)
Thanks, I'll report again with a larger index measurement.
On Dec 10, 2007 11:06 PM, Michael Busch <[EMAIL PROTECTED]> wrote:
> Shai Erera wrote:
> > No I have
Shai Erera wrote:
> No I haven't done that (to be honest, I don't know how to do that ... :-) ).
> That's the reason I ran both tests multiple times and reported the last run.
>
Reboot your machine ;-) That's what I usually do - if there's another
way I'd like to know as well!
-Michael
No I haven't done that (to be honest, I don't know how to do that ... :-) ).
That's the reason I ran both tests multiple times and reported the last run.
On Dec 10, 2007 10:24 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
> On 10-Dec-07, at 12:11 PM, Shai Erera wrote:
>
> > Actually, queries on large
On 10-Dec-07, at 12:11 PM, Shai Erera wrote:
Actually, queries on large indexes are not necessarily I/O bound.
It depends
on how much of the posting list is being read into memory at once.
I'm not
that familiar with the inner-most of Lucene, but let's assume a
posting
element takes 4 bytes
I think you might be underestimating the IO cost a bit.
A modern drive does 40-70 mb per second, but that is sequential
reads. The seek time (9 ms is good) can kill you.
Because of disk fragmentation, there is no guarantee that the posting
is sequential on the disk. Obviously the OS and dr
Well .. I suspect this behavior is due the nature of the index - 100K docs
duplicated 10 times. Therefore at some point it hits the same documents (and
scores). Like I said, tomorrow I'll re-run the test on a 10M unique docs
index.
I agree that 80 allocations are not much, but that's per query. The
Actually, queries on large indexes are not necessarily I/O bound. It depends
on how much of the posting list is being read into memory at once. I'm not
that familiar with the inner-most of Lucene, but let's assume a posting
element takes 4 bytes for docId and 2 more bytes per position in a document
Shai Erera wrote:
Hi
Well, I have results from a 1M index for now (the index contains 100K
documents duplicated 10 times, so it's not the final test I'll run,
but it
still shows something). I ran 2000 short queries (2.4 keywords on
average)
on a 1M docs index, after 50 queries warm-up. Fol
As has been pointed out on many threads, a modern JVM really doesn't
need you to recycle objects, especially for small short lived objects.
It is actually less efficient in many cases (since the variables need
to be reinitialized).
Using object pools (except when pooling external resources)
On 10-Dec-07, at 11:31 AM, Shai Erera wrote:
As you can see, the actual allocation time is really negligible and
there
isn't much difference in the avg. running times of the queries.
However, the
*current* runs performed a lot worse at the beginning, before the
OS cache
warmed up.
This s
Hi
Well, I have results from a 1M index for now (the index contains 100K
documents duplicated 10 times, so it's not the final test I'll run, but it
still shows something). I ran 2000 short queries (2.4 keywords on average)
on a 1M docs index, after 50 queries warm-up. Following are the results:
C
On Monday 10 December 2007 09:19:43 Shai Erera wrote:
> I agree w.r.t the current implementation, however in the worse case (as we
> tend to consider when talking about computer algorithms), it will allocate a
> ScoreDoc per result. With the overflow reuse, it will not allocate those
> objects, no
OK, sounds like a plan, thanks!
Yes, contrib/benchmark has EnwikiDocMaker to generate docs off the
Wikipedia XML export file.
Mike
On Dec 10, 2007, at 7:03 AM, Shai Erera wrote:
I have access to TREC. I can try this.
W.r.t the large indexes - I don't have access to the data, just
scenar
I have a TREC index of 25M docs that I can try this
with. Shai, if you can create an issue and upload a
patch that I (and others) can experiment with, I
will try a few queries on this index with and
without your patch...
Doron
Michael McCandless <[EMAIL PROTECTED]> wrote on 10/12/2007
13:50:53:
I have access to TREC. I can try this.
W.r.t the large indexes - I don't have access to the data, just scenarios
our customers ran into the past.
Does the benchmark package includes code to crawl Wikipedia? If not, do you
have such code? I don't want to write it from scratch for this specific
task.
I don't offhand. Working on the indexing side is so much easier :)
You mentioned your experience with large indices & large result sets
-- is that something you could draw on?
There have also been discussions lately about finding real search
logs we could use for exactly this reason, thou
Do you have a dataset and queries I can test on?
On Dec 10, 2007 1:16 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:
> Shai Erera wrote:
>
> > No - I didn't try to populate an index with real data and run real
> > queries
> > (what is "real" after all?). I know from my experience of indexes wi
Shai Erera wrote:
No - I didn't try to populate an index with real data and run real
queries
(what is "real" after all?). I know from my experience of indexes with
several millions of documents where there are queries with several
hundred
thousands results (one query even hit 2.5 M documents
No - I didn't try to populate an index with real data and run real queries
(what is "real" after all?). I know from my experience of indexes with
several millions of documents where there are queries with several hundred
thousands results (one query even hit 2.5 M documents). This is typical in
sea
Have you done any tests on real queries to see what impact this
improvement has in practice? Or, to measure how many ScoreDocs are
"typically" allocated?
Mike
Shai Erera wrote:
I agree w.r.t the current implementation, however in the worse case
(as we
tend to consider when talking abou
I agree w.r.t the current implementation, however in the worse case (as we
tend to consider when talking about computer algorithms), it will allocate a
ScoreDoc per result. With the overflow reuse, it will not allocate those
objects, no matter what's the input it gets.
Also, notice that there is a
The current TopDocCollector only allocates a ScoreDoc when the given
score causes a new ScoreDoc to be added into the queue, but it does
not reuse anything that overflows out of the queue.
So, reusing the overflow out of the queue should reduce object
allocations. especially for indexes that tend t
Hi
Lucene's PQ implements two methods: put (assumes the PQ has room for the
object) and insert (checks whether the object can be inserted etc.). The
implementation of insert() requires the application that uses it to allocate
a new object every time it calls insert. Specifically, it cannot reuse t
41 matches
Mail list logo