Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Paul Elschot
The current TopDocCollector only allocates a ScoreDoc when the given score causes a new ScoreDoc to be added into the queue, but it does not reuse anything that overflows out of the queue. So, reusing the overflow out of the queue should reduce object allocations. especially for indexes that tend t

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
I agree w.r.t the current implementation, however in the worse case (as we tend to consider when talking about computer algorithms), it will allocate a ScoreDoc per result. With the overflow reuse, it will not allocate those objects, no matter what's the input it gets. Also, notice that there is a

Re: Comparable ScoreDoc

2007-12-10 Thread Shai Erera
BTW, HitQueue performs this comparison: protected final boolean lessThan(Object a, Object b) { ScoreDoc hitA = (ScoreDoc)a; ScoreDoc hitB = (ScoreDoc)b; if (hitA.score == hitB.score) return hitA.doc > hitB.doc; else return hitA.score < hitB.score; } As you can see, t

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
Have you done any tests on real queries to see what impact this improvement has in practice? Or, to measure how many ScoreDocs are "typically" allocated? Mike Shai Erera wrote: I agree w.r.t the current implementation, however in the worse case (as we tend to consider when talking abou

Fwd: Can changes on an index be visible to an open IndexSearcher without reopening it?

2007-12-10 Thread Michael McCandless
Carrying this excellent question over to java-dev (see below). The idea of "incrementally fixing up the FieldCache" has been discussed before, eg most recently here: http://www.gossamer-threads.com/lists/lucene/java-dev/53852#53852 And I think this issue from Hoss is working towards it:

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
No - I didn't try to populate an index with real data and run real queries (what is "real" after all?). I know from my experience of indexes with several millions of documents where there are queries with several hundred thousands results (one query even hit 2.5 M documents). This is typical in sea

Re: Fwd: Can changes on an index be visible to an open IndexSearcher without reopening it?

2007-12-10 Thread Michael Busch
Michael McCandless wrote: > > I haven't really looked closely at these (I've been focusing on the > indexing side of the house so far!), but, I do think these ideas are > important to pursue soon (after 2.3). We do really need reopen at the > IndexSearcher level to be as fast as it can be. > > I

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
Shai Erera wrote: No - I didn't try to populate an index with real data and run real queries (what is "real" after all?). I know from my experience of indexes with several millions of documents where there are queries with several hundred thousands results (one query even hit 2.5 M documents

Re: Can changes on an index be visible to an open IndexSearcher without reopening it?

2007-12-10 Thread Michael McCandless
OK, excellent. I just wanted to make sure this thread is still "alive" :) This is an important optimization to decrease cost of opening & re-opening searchers. Mike Michael Busch wrote: Michael McCandless wrote: I haven't really looked closely at these (I've been focusing on the indexing

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Do you have a dataset and queries I can test on? On Dec 10, 2007 1:16 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Shai Erera wrote: > > > No - I didn't try to populate an index with real data and run real > > queries > > (what is "real" after all?). I know from my experience of indexes wi

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
I don't offhand. Working on the indexing side is so much easier :) You mentioned your experience with large indices & large result sets -- is that something you could draw on? There have also been discussions lately about finding real search logs we could use for exactly this reason, thou

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
I have access to TREC. I can try this. W.r.t the large indexes - I don't have access to the data, just scenarios our customers ran into the past. Does the benchmark package includes code to crawl Wikipedia? If not, do you have such code? I don't want to write it from scratch for this specific task.

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Doron Cohen
I have a TREC index of 25M docs that I can try this with. Shai, if you can create an issue and upload a patch that I (and others) can experiment with, I will try a few queries on this index with and without your patch... Doron Michael McCandless <[EMAIL PROTECTED]> wrote on 10/12/2007 13:50:53:

[jira] Created: (LUCENE-1086) Incorrect behavior in TrecDocMaker

2007-12-10 Thread Shai Erera (JIRA)
Incorrect behavior in TrecDocMaker -- Key: LUCENE-1086 URL: https://issues.apache.org/jira/browse/LUCENE-1086 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter

[jira] Updated: (LUCENE-1086) Incorrect behavior in TrecDocMaker

2007-12-10 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1086: --- Attachment: TrecDocMaker.patch Simple patch that creates a File on docs.dir and if it is not absolut

[jira] Assigned: (LUCENE-1086) Incorrect behavior in TrecDocMaker

2007-12-10 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen reassigned LUCENE-1086: --- Assignee: Doron Cohen > Incorrect behavior in TrecDocMaker > ---

Re: O/S Search Comparisons

2007-12-10 Thread Grant Ingersoll
On Dec 7, 2007, at 3:01 PM, Mark Miller wrote: Yes, and even if they did not use the stock defaults, I would bet there would be complaints about what was done wrong at every turn. This seems like a very difficult thing to do. How long does it take to fully learn how to correctly utilize ea

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
OK, sounds like a plan, thanks! Yes, contrib/benchmark has EnwikiDocMaker to generate docs off the Wikipedia XML export file. Mike On Dec 10, 2007, at 7:03 AM, Shai Erera wrote: I have access to TREC. I can try this. W.r.t the large indexes - I don't have access to the data, just scenar

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550078 ] Yonik Seeley commented on LUCENE-753: - Brad, one possible difference is the number of threads we tested with. I t

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Paul Elschot
On Monday 10 December 2007 09:19:43 Shai Erera wrote: > I agree w.r.t the current implementation, however in the worse case (as we > tend to consider when talking about computer algorithms), it will allocate a > ScoreDoc per result. With the overflow reuse, it will not allocate those > objects, no

Re: WebLuke - include Jetty in Lucene binary distribution?

2007-12-10 Thread Grant Ingersoll
Looks good! I especially like the visualizations and can see people adding more visualization capabilities as it gets used more. I don't know that we have ever checked in IDE settings (Eclipse settings). In fact, I think we have svn:ignore setup in most places for them. Aren't they user

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Brian Pinkerton (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550128 ] Brian Pinkerton commented on LUCENE-753: Whoops; I should have paid more attention to the args. The results

Re: WebLuke - include Jetty in Lucene binary distribution?

2007-12-10 Thread mark harwood
>>I don't know that we have ever checked in IDE settings GWT development is much easier with the IDE and there is a fair amount of manual setup required without the settings to run the "hosted" development environment. Hosted development is the key productivity benefit and allows debugging in J

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550130 ] Doug Cutting commented on LUCENE-753: - > Brad, [...] That's Brian. And right, the difference in your tests is t

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550135 ] Doug Cutting commented on LUCENE-753: - My prior remarks were posted before I saw Brian's latest benchmarks. Whil

Re: WebLuke - include Jetty in Lucene binary distribution?

2007-12-10 Thread Grant Ingersoll
On Dec 10, 2007, at 12:32 PM, mark harwood wrote: I don't know that we have ever checked in IDE settings GWT development is much easier with the IDE and there is a fair amount of manual setup required without the settings to run the "hosted" development environment. Hosted development is

Re: WebLuke - include Jetty in Lucene binary distribution?

2007-12-10 Thread Mark Miller
I don't know that we have ever checked in IDE settings GWT development is much easier with the IDE and there is a fair amount of manual setup required without the settings to run the "hosted" development environment. Hosted development is the key productivity benefit and allows debuggin

[jira] Commented: (LUCENE-944) Remove deprecated methods in BooleanQuery

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550149 ] Grant Ingersoll commented on LUCENE-944: Hmm, I was just trying to compile Luke source against the latest Luc

[jira] Created: (LUCENE-1087) Explain shows incorrect docFreq number when used for documents in different indices searched via MultiSearcher

2007-12-10 Thread Yasoja Seneviratne (JIRA)
Explain shows incorrect docFreq number when used for documents in different indices searched via MultiSearcher -- Key: LUCENE-1087 URL: https://issues.apac

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Hi Well, I have results from a 1M index for now (the index contains 100K documents duplicated 10 times, so it's not the final test I'll run, but it still shows something). I ran 2000 short queries (2.4 keywords on average) on a 1M docs index, after 50 queries warm-up. Following are the results: C

Re: O/S Search Comparisons

2007-12-10 Thread Mike Klaas
On 8-Dec-07, at 10:04 PM, Doron Cohen wrote: +1 I have been thinking about this too. Solr clearly demonstrates the benefits of this kind of approach, although even it doesn't make it seamless for users in the sense that they still need to divvy up the docs on the app side. Would be nice if t

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550165 ] Yonik Seeley commented on LUCENE-753: - Weird... I'm still getting slower results from pread on WinXP. Can someone

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 11:31 AM, Shai Erera wrote: As you can see, the actual allocation time is really negligible and there isn't much difference in the avg. running times of the queries. However, the *current* runs performed a lot worse at the beginning, before the OS cache warmed up. This s

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread robert engels (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550168 ] robert engels commented on LUCENE-753: -- I sent this via email, but probably need to add to the thread... I post

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread robert engels
As has been pointed out on many threads, a modern JVM really doesn't need you to recycle objects, especially for small short lived objects. It is actually less efficient in many cases (since the variables need to be reinitialized). Using object pools (except when pooling external resources)

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael McCandless
Shai Erera wrote: Hi Well, I have results from a 1M index for now (the index contains 100K documents duplicated 10 times, so it's not the final test I'll run, but it still shows something). I ran 2000 short queries (2.4 keywords on average) on a 1M docs index, after 50 queries warm-up. Fol

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Actually, queries on large indexes are not necessarily I/O bound. It depends on how much of the posting list is being read into memory at once. I'm not that familiar with the inner-most of Lucene, but let's assume a posting element takes 4 bytes for docId and 2 more bytes per position in a document

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Well .. I suspect this behavior is due the nature of the index - 100K docs duplicated 10 times. Therefore at some point it hits the same documents (and scores). Like I said, tomorrow I'll re-run the test on a 10M unique docs index. I agree that 80 allocations are not much, but that's per query. The

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread robert engels
I think you might be underestimating the IO cost a bit. A modern drive does 40-70 mb per second, but that is sequential reads. The seek time (9 ms is good) can kill you. Because of disk fragmentation, there is no guarantee that the posting is sequential on the disk. Obviously the OS and dr

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 12:11 PM, Shai Erera wrote: Actually, queries on large indexes are not necessarily I/O bound. It depends on how much of the posting list is being read into memory at once. I'm not that familiar with the inner-most of Lucene, but let's assume a posting element takes 4 bytes

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
No I haven't done that (to be honest, I don't know how to do that ... :-) ). That's the reason I ran both tests multiple times and reported the last run. On Dec 10, 2007 10:24 PM, Mike Klaas <[EMAIL PROTECTED]> wrote: > On 10-Dec-07, at 12:11 PM, Shai Erera wrote: > > > Actually, queries on large

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550179 ] Doug Cutting commented on LUCENE-753: - So it looks like pread is ~50% slower on Windows, and ~5-25% faster on oth

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550182 ] Michael McCandless commented on LUCENE-753: --- {quote} Is that enough of a difference that it might be worth

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Michael Busch
Shai Erera wrote: > No I haven't done that (to be honest, I don't know how to do that ... :-) ). > That's the reason I ran both tests multiple times and reported the last run. > Reboot your machine ;-) That's what I usually do - if there's another way I'd like to know as well! -Michael

[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread robert engels (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550187 ] robert engels commented on LUCENE-753: -- As an aside, if the Lucene people voted on the Java bug (and or sent ema

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
I see :-). It's just that "clear up the disk cache" sounds so professional, I assumed there's a way to do it that I don't know of ... :-) Thanks, I'll report again with a larger index measurement. On Dec 10, 2007 11:06 PM, Michael Busch <[EMAIL PROTECTED]> wrote: > Shai Erera wrote: > > No I have

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550192 ] Grant Ingersoll commented on LUCENE-1068: - Hi Shai, Thanks for the patch. Can you please add unit tests i

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread robert engels
On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches On "OS X", use "purge", which is part of the CHUD tools. On Windows, I think you're hosed. On Dec 10, 2007, at 3:06 PM, Michael Busch wrote: Shai Erera wrote: No I haven't done that (to be honest, I don't know how to do that ...

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Shai Erera
Thanks for the info. Too bad I use Windows ... On Dec 10, 2007 11:16 PM, robert engels <[EMAIL PROTECTED]> wrote: > On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches > On "OS X", use "purge", which is part of the CHUD tools. > On Windows, I think you're hosed. > > On Dec 10, 2007,

[jira] Updated: (LUCENE-1017) BoostingTermQuery performance

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1017: Priority: Minor (was: Major) Lucene Fields: [New, Patch Available] (was: [Patch

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-12-10 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550196 ] Shai Erera commented on LUCENE-1068: Hi Grant, I used Eclipse to generate the patch (right-click on org.apache.

[jira] Commented: (LUCENE-1017) BoostingTermQuery performance

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550198 ] Grant Ingersoll commented on LUCENE-1017: - Peter, Can you share your test for measuring this? Ideally as a

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Daniel Naber
On Montag, 10. Dezember 2007, Michael Busch wrote: > Reboot your machine ;-) That's what I usually do - if there's another > way I'd like to know as well! On Linux (kernel 2.6.16 and later), call: sync ; echo 3 > /proc/sys/vm/drop_caches Regards Daniel -- http://www.danielnaber.de -

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202 ] Grant Ingersoll commented on LUCENE-1068: - Hmmm, maybe there is a way in Eclipse to make the path relative t

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 1:20 PM, Shai Erera wrote: Thanks for the info. Too bad I use Windows ... Just allocate a bunch of memory and free it. This linux, but something similar should work on windows: $ vmstat -S M procs ---memory-- r b swpd free buff cache 0 0 0

[jira] Commented: (LUCENE-1017) BoostingTermQuery performance

2007-12-10 Thread Peter Keegan (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550203 ] Peter Keegan commented on LUCENE-1017: -- Grant, Unfortunately, my performance test bed isn't suitable for contr

[jira] Resolved: (LUCENE-1042) discrepancy in getTermFreqVector-methods

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved LUCENE-1042. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, N

[jira] Commented: (LUCENE-1017) BoostingTermQuery performance

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550207 ] Grant Ingersoll commented on LUCENE-1017: - {quote} I still think it would be nice to have BoostingTermQuery

[jira] Commented: (LUCENE-550) InstantiatedIndex - faster but memory consuming index

2007-12-10 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550209 ] Grant Ingersoll commented on LUCENE-550: {quote} courtesy of Olivier Chafik {quote} What does this mean? He

[jira] Commented: (LUCENE-550) InstantiatedIndex - faster but memory consuming index

2007-12-10 Thread Karl Wettin (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550215 ] Karl Wettin commented on LUCENE-550: {quote} Grant Ingersoll - 10/Dec/07 02:11 PM > courtesy of Olivier Chafik Wh

[jira] Updated: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2007-12-10 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-753: Attachment: FileReadTest.java Updated test that fixes some thread synchronization issues to ensure

[jira] Commented: (LUCENE-1017) BoostingTermQuery performance

2007-12-10 Thread Peter Keegan (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550238 ] Peter Keegan commented on LUCENE-1017: -- > What's the use case? Is there something that isn't possible with it a