The current TopDocCollector only allocates a ScoreDoc when the given
score causes a new ScoreDoc to be added into the queue, but it does
not reuse anything that overflows out of the queue.
So, reusing the overflow out of the queue should reduce object
allocations. especially for indexes that tend t
I agree w.r.t the current implementation, however in the worse case (as we
tend to consider when talking about computer algorithms), it will allocate a
ScoreDoc per result. With the overflow reuse, it will not allocate those
objects, no matter what's the input it gets.
Also, notice that there is a
BTW, HitQueue performs this comparison:
protected final boolean lessThan(Object a, Object b) {
ScoreDoc hitA = (ScoreDoc)a;
ScoreDoc hitB = (ScoreDoc)b;
if (hitA.score == hitB.score)
return hitA.doc > hitB.doc;
else
return hitA.score < hitB.score;
}
As you can see, t
Have you done any tests on real queries to see what impact this
improvement has in practice? Or, to measure how many ScoreDocs are
"typically" allocated?
Mike
Shai Erera wrote:
I agree w.r.t the current implementation, however in the worse case
(as we
tend to consider when talking abou
Carrying this excellent question over to java-dev (see below).
The idea of "incrementally fixing up the FieldCache" has been
discussed before, eg most recently here:
http://www.gossamer-threads.com/lists/lucene/java-dev/53852#53852
And I think this issue from Hoss is working towards it:
No - I didn't try to populate an index with real data and run real queries
(what is "real" after all?). I know from my experience of indexes with
several millions of documents where there are queries with several hundred
thousands results (one query even hit 2.5 M documents). This is typical in
sea
Michael McCandless wrote:
>
> I haven't really looked closely at these (I've been focusing on the
> indexing side of the house so far!), but, I do think these ideas are
> important to pursue soon (after 2.3). We do really need reopen at the
> IndexSearcher level to be as fast as it can be.
>
> I
Shai Erera wrote:
No - I didn't try to populate an index with real data and run real
queries
(what is "real" after all?). I know from my experience of indexes with
several millions of documents where there are queries with several
hundred
thousands results (one query even hit 2.5 M documents
OK, excellent. I just wanted to make sure this thread is still
"alive" :) This is an important optimization to decrease cost of
opening & re-opening searchers.
Mike
Michael Busch wrote:
Michael McCandless wrote:
I haven't really looked closely at these (I've been focusing on the
indexing
Do you have a dataset and queries I can test on?
On Dec 10, 2007 1:16 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:
> Shai Erera wrote:
>
> > No - I didn't try to populate an index with real data and run real
> > queries
> > (what is "real" after all?). I know from my experience of indexes wi
I don't offhand. Working on the indexing side is so much easier :)
You mentioned your experience with large indices & large result sets
-- is that something you could draw on?
There have also been discussions lately about finding real search
logs we could use for exactly this reason, thou
I have access to TREC. I can try this.
W.r.t the large indexes - I don't have access to the data, just scenarios
our customers ran into the past.
Does the benchmark package includes code to crawl Wikipedia? If not, do you
have such code? I don't want to write it from scratch for this specific
task.
I have a TREC index of 25M docs that I can try this
with. Shai, if you can create an issue and upload a
patch that I (and others) can experiment with, I
will try a few queries on this index with and
without your patch...
Doron
Michael McCandless <[EMAIL PROTECTED]> wrote on 10/12/2007
13:50:53:
Incorrect behavior in TrecDocMaker
--
Key: LUCENE-1086
URL: https://issues.apache.org/jira/browse/LUCENE-1086
Project: Lucene - Java
Issue Type: Bug
Components: contrib/benchmark
Reporter
[
https://issues.apache.org/jira/browse/LUCENE-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shai Erera updated LUCENE-1086:
---
Attachment: TrecDocMaker.patch
Simple patch that creates a File on docs.dir and if it is not absolut
[
https://issues.apache.org/jira/browse/LUCENE-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doron Cohen reassigned LUCENE-1086:
---
Assignee: Doron Cohen
> Incorrect behavior in TrecDocMaker
> ---
On Dec 7, 2007, at 3:01 PM, Mark Miller wrote:
Yes, and even if they did not use the stock defaults, I would bet
there would be complaints about what was done wrong at every turn.
This seems like a very difficult thing to do. How long does it take
to fully learn how to correctly utilize ea
OK, sounds like a plan, thanks!
Yes, contrib/benchmark has EnwikiDocMaker to generate docs off the
Wikipedia XML export file.
Mike
On Dec 10, 2007, at 7:03 AM, Shai Erera wrote:
I have access to TREC. I can try this.
W.r.t the large indexes - I don't have access to the data, just
scenar
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550078
]
Yonik Seeley commented on LUCENE-753:
-
Brad, one possible difference is the number of threads we tested with.
I t
On Monday 10 December 2007 09:19:43 Shai Erera wrote:
> I agree w.r.t the current implementation, however in the worse case (as we
> tend to consider when talking about computer algorithms), it will allocate a
> ScoreDoc per result. With the overflow reuse, it will not allocate those
> objects, no
Looks good! I especially like the visualizations and can see people
adding more visualization capabilities as it gets used more.
I don't know that we have ever checked in IDE settings (Eclipse
settings). In fact, I think we have svn:ignore setup in most places
for them. Aren't they user
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550128
]
Brian Pinkerton commented on LUCENE-753:
Whoops; I should have paid more attention to the args. The results
>>I don't know that we have ever checked in IDE settings
GWT development is much easier with the IDE and there is a fair amount of
manual setup required without the settings to run the "hosted" development
environment. Hosted development is the key productivity benefit and allows
debugging in J
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550130
]
Doug Cutting commented on LUCENE-753:
-
> Brad, [...]
That's Brian. And right, the difference in your tests is t
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550135
]
Doug Cutting commented on LUCENE-753:
-
My prior remarks were posted before I saw Brian's latest benchmarks.
Whil
On Dec 10, 2007, at 12:32 PM, mark harwood wrote:
I don't know that we have ever checked in IDE settings
GWT development is much easier with the IDE and there is a fair
amount of manual setup required without the settings to run the
"hosted" development environment. Hosted development is
I don't know that we have ever checked in IDE settings
GWT development is much easier with the IDE and there is a fair amount of manual setup required without the settings to run the "hosted" development environment. Hosted development is the key productivity benefit and allows debuggin
[
https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550149
]
Grant Ingersoll commented on LUCENE-944:
Hmm, I was just trying to compile Luke source against the latest Luc
Explain shows incorrect docFreq number when used for documents in different
indices searched via MultiSearcher
--
Key: LUCENE-1087
URL: https://issues.apac
Hi
Well, I have results from a 1M index for now (the index contains 100K
documents duplicated 10 times, so it's not the final test I'll run, but it
still shows something). I ran 2000 short queries (2.4 keywords on average)
on a 1M docs index, after 50 queries warm-up. Following are the results:
C
On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
+1 I have been thinking about this too. Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.
Would be nice if t
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550165
]
Yonik Seeley commented on LUCENE-753:
-
Weird... I'm still getting slower results from pread on WinXP.
Can someone
On 10-Dec-07, at 11:31 AM, Shai Erera wrote:
As you can see, the actual allocation time is really negligible and
there
isn't much difference in the avg. running times of the queries.
However, the
*current* runs performed a lot worse at the beginning, before the
OS cache
warmed up.
This s
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550168
]
robert engels commented on LUCENE-753:
--
I sent this via email, but probably need to add to the thread...
I post
As has been pointed out on many threads, a modern JVM really doesn't
need you to recycle objects, especially for small short lived objects.
It is actually less efficient in many cases (since the variables need
to be reinitialized).
Using object pools (except when pooling external resources)
Shai Erera wrote:
Hi
Well, I have results from a 1M index for now (the index contains 100K
documents duplicated 10 times, so it's not the final test I'll run,
but it
still shows something). I ran 2000 short queries (2.4 keywords on
average)
on a 1M docs index, after 50 queries warm-up. Fol
Actually, queries on large indexes are not necessarily I/O bound. It depends
on how much of the posting list is being read into memory at once. I'm not
that familiar with the inner-most of Lucene, but let's assume a posting
element takes 4 bytes for docId and 2 more bytes per position in a document
Well .. I suspect this behavior is due the nature of the index - 100K docs
duplicated 10 times. Therefore at some point it hits the same documents (and
scores). Like I said, tomorrow I'll re-run the test on a 10M unique docs
index.
I agree that 80 allocations are not much, but that's per query. The
I think you might be underestimating the IO cost a bit.
A modern drive does 40-70 mb per second, but that is sequential
reads. The seek time (9 ms is good) can kill you.
Because of disk fragmentation, there is no guarantee that the posting
is sequential on the disk. Obviously the OS and dr
On 10-Dec-07, at 12:11 PM, Shai Erera wrote:
Actually, queries on large indexes are not necessarily I/O bound.
It depends
on how much of the posting list is being read into memory at once.
I'm not
that familiar with the inner-most of Lucene, but let's assume a
posting
element takes 4 bytes
No I haven't done that (to be honest, I don't know how to do that ... :-) ).
That's the reason I ran both tests multiple times and reported the last run.
On Dec 10, 2007 10:24 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
> On 10-Dec-07, at 12:11 PM, Shai Erera wrote:
>
> > Actually, queries on large
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550179
]
Doug Cutting commented on LUCENE-753:
-
So it looks like pread is ~50% slower on Windows, and ~5-25% faster on oth
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550182
]
Michael McCandless commented on LUCENE-753:
---
{quote}
Is that enough of a difference that it might be worth
Shai Erera wrote:
> No I haven't done that (to be honest, I don't know how to do that ... :-) ).
> That's the reason I ran both tests multiple times and reported the last run.
>
Reboot your machine ;-) That's what I usually do - if there's another
way I'd like to know as well!
-Michael
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550187
]
robert engels commented on LUCENE-753:
--
As an aside, if the Lucene people voted on the Java bug (and or sent ema
I see :-). It's just that "clear up the disk cache" sounds so professional,
I assumed there's a way to do it that I don't know of ... :-)
Thanks, I'll report again with a larger index measurement.
On Dec 10, 2007 11:06 PM, Michael Busch <[EMAIL PROTECTED]> wrote:
> Shai Erera wrote:
> > No I have
[
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550192
]
Grant Ingersoll commented on LUCENE-1068:
-
Hi Shai,
Thanks for the patch. Can you please add unit tests i
On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches
On "OS X", use "purge", which is part of the CHUD tools.
On Windows, I think you're hosed.
On Dec 10, 2007, at 3:06 PM, Michael Busch wrote:
Shai Erera wrote:
No I haven't done that (to be honest, I don't know how to do
that ...
Thanks for the info. Too bad I use Windows ...
On Dec 10, 2007 11:16 PM, robert engels <[EMAIL PROTECTED]> wrote:
> On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches
> On "OS X", use "purge", which is part of the CHUD tools.
> On Windows, I think you're hosed.
>
> On Dec 10, 2007,
[
https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Ingersoll updated LUCENE-1017:
Priority: Minor (was: Major)
Lucene Fields: [New, Patch Available] (was: [Patch
[
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550196
]
Shai Erera commented on LUCENE-1068:
Hi Grant,
I used Eclipse to generate the patch (right-click on
org.apache.
[
https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550198
]
Grant Ingersoll commented on LUCENE-1017:
-
Peter,
Can you share your test for measuring this? Ideally as a
On Montag, 10. Dezember 2007, Michael Busch wrote:
> Reboot your machine ;-) That's what I usually do - if there's another
> way I'd like to know as well!
On Linux (kernel 2.6.16 and later), call:
sync ; echo 3 > /proc/sys/vm/drop_caches
Regards
Daniel
--
http://www.danielnaber.de
-
[
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202
]
Grant Ingersoll commented on LUCENE-1068:
-
Hmmm, maybe there is a way in Eclipse to make the path relative t
On 10-Dec-07, at 1:20 PM, Shai Erera wrote:
Thanks for the info. Too bad I use Windows ...
Just allocate a bunch of memory and free it. This linux, but
something similar should work on windows:
$ vmstat -S M
procs ---memory--
r b swpd free buff cache
0 0 0
[
https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550203
]
Peter Keegan commented on LUCENE-1017:
--
Grant,
Unfortunately, my performance test bed isn't suitable for contr
[
https://issues.apache.org/jira/browse/LUCENE-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Ingersoll resolved LUCENE-1042.
-
Resolution: Fixed
Lucene Fields: [New, Patch Available] (was: [Patch Available, N
[
https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550207
]
Grant Ingersoll commented on LUCENE-1017:
-
{quote}
I still think it would be nice to have BoostingTermQuery
[
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550209
]
Grant Ingersoll commented on LUCENE-550:
{quote}
courtesy of Olivier Chafik
{quote}
What does this mean? He
[
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550215
]
Karl Wettin commented on LUCENE-550:
{quote}
Grant Ingersoll - 10/Dec/07 02:11 PM
> courtesy of Olivier Chafik
Wh
[
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yonik Seeley updated LUCENE-753:
Attachment: FileReadTest.java
Updated test that fixes some thread synchronization issues to ensure
[
https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550238
]
Peter Keegan commented on LUCENE-1017:
--
> What's the use case? Is there something that isn't possible with it a
62 matches
Mail list logo