Re: Batch searching

2009-07-23 Thread Matthew Hall
This was at least one of the threads that was bouncing around... I'm fairly sure there were others as well. Hopefully its worth the read to you ^^ http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html Phil Whelan wrote: On Wed, Jul 22, 2009 at 12:28 PM, Matthew

Batch searching

2009-07-22 Thread tsuraan
If I understand lucene correctly, when doing multiple simultaneous searches on the same IndexSearcher, they will basically all do their own index scans and collect results independently. If that's correct, is there a way to batch searches together, so only one index scan is done? What I'd like

Re: Batch searching

2009-07-22 Thread Shai Erera
It's not accurate to say that Lucene scans the index for each search. Rather, every Query reads a set of posting lists, each are typically read from disk. If you pass Query[] which have nothing to do in common (for example no terms in common), then you won't gain anything, b/c each Query will

Re: Batch searching

2009-07-22 Thread Matthew Hall
If you did this, wouldn't you be binding the processing of the results of all queries to that of the slowest performing one within the collection? I'm guessing you are trying for some sort of performance benefit by batch processing, but I question whether or not you will actually get more

Re: Batch searching

2009-07-22 Thread tsuraan
It's not accurate to say that Lucene scans the index for each search. Rather, every Query reads a set of posting lists, each are typically read from disk. If you pass Query[] which have nothing to do in common (for example no terms in common), then you won't gain anything, b/c each Query will

Re: Batch searching

2009-07-22 Thread Shai Erera
Queries cannot be ordered sequentially. Let's say that you run 3 Queries, w/ one term each a, b and c. On disk, the posting lists of the terms can look like this: post1(a), post1(c), post2(a), post1(b), post2(c), post2(b) etc. They are not guaranteed to be consecutive. The code makes sure the

Re: Batch searching

2009-07-22 Thread tsuraan
If you did this, wouldn't you be binding the processing of the results of all queries to that of the slowest performing one within the collection? I would imagine it would, but I haven't seen too much variance between lucene query speeds in our data. I'm guessing you are trying for some sort

Re: Batch searching

2009-07-22 Thread Matthew Hall
Out of curiosity, what is the size of your corpus? How much and how quickly do you expect it to grow? I'm just trying to make sure that we are all on the same page here ^^ I can see the benefits of doing what you are describing with a very large corpus that is expected to grow at quick rate,

Re: Batch searching

2009-07-22 Thread tsuraan
Out of curiosity, what is the size of your corpus? How much and how quickly do you expect it to grow? in terms of lucene documents, we tend to have in the 10M-100M range. Currently we use merging to make larger indices from smaller ones, so a single index can have a lot of documents in it, but

Re: Batch searching

2009-07-22 Thread Matthew Hall
Not sure if this helps you, but some of the issue you are facing seem similar to those in the real time search threads. Basically their problem involves indexing twitter and the blogosphere, and making lucene work for super large data sets like that. Perhaps some of the discussion in those

Re: Batch searching

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hallmh...@informatics.jax.org wrote: Not sure if this helps you, but some of the issue you are facing seem similar to those in the real time search threads. Hi Matthew, Do you have a pointer of where to go to see the real time threads? Thanks, Phil