RE: Parallelising a query...

2001-11-29 Thread Winton Davies

Whoops, must think before I write...

I think this is *good* for me.

Let's say I have a default sort property I want to have:

If I only do a one-time Index, if I sort the input documents in the 
order I want before indexing, then the results will come out in that 
sort order -- assuming I implement my own hit collector (basically 
just ignore the score...)

  Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Parallelising a query...

2001-11-29 Thread Winton Davies

Thanks ! (one thing to cross off my list of optimizations...)
 
  Cheers,
   Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Parallelising a query...

2001-11-29 Thread Doug Cutting

TermDocs are ordered by document number.  It would not be easy to change
this.

Doug

> -Original Message-
> From: Winton Davies [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, November 29, 2001 11:12 AM
> To: Lucene Users List
> Subject: Re: Parallelising a query...
> 
> 
> Hi again
> 
>   Another dumb question :) (actually I'm too busy to look at 
> the code :) )
> 
>In the index, is the datastructure of termDocs (is that the right 
> term), sorted by anything? Or is it just insertion order ? I could 
> see how one might want to sort by the Doc with the highest term 
> frequency  ? But I can also see why
> it might not help.
> 
>   e.g.   Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)
> 
>   or is it like this ?
> 
>  Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?
> 
>  
>   I have an idea for an optimization I want to make, but I'm not sure 
> exactly whether it is warrants investigation.
> 
>   Winton
> 
> 
> Winton Davies
> Lead Engineer, Overture (NSDQ: OVER)
> 1820 Gateway Drive, Suite 360
> San Mateo, CA 94404
> work: (650) 403-2259
> cell: (650) 867-1598
> http://www.overture.com/
> 
> --
> To unsubscribe, e-mail:   
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Parallelising a query...

2001-11-29 Thread Winton Davies

Hi again

  Another dumb question :) (actually I'm too busy to look at the code :) )

   In the index, is the datastructure of termDocs (is that the right 
term), sorted by anything? Or is it just insertion order ? I could 
see how one might want to sort by the Doc with the highest term 
frequency  ? But I can also see why
it might not help.

  e.g.   Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)

  or is it like this ?

 Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?

 
  I have an idea for an optimization I want to make, but I'm not sure 
exactly whether it is warrants investigation.

  Winton


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Parallelising a query...

2001-11-29 Thread Winton Davies

Hi Doug,
 
  Thank you again for your wisdom. Yep, we'll probably be running on a 
quad E420 or two.

  I think that I'll get better performance with one virtual searcher, 
spread over 4 CPU's (and < 1 GB ram each) rather than 4 monolithics 
searchers each with its own index (actually I don't think I could get 
4 on the machine --- the index is nearly a Gig each).

  Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Parallelising a query...

2001-11-29 Thread Doug Cutting

> From: Winton Davies [mailto:[EMAIL PROTECTED]]
> 
>I have 4 million documents... I could:
>  
>Split these into 4 x 1 million document indexes  and then send a 
> query to 4 Lucene processes ? At the end I would have to sort the 
> results by relevance.
> 
>Question for Doug or any other Search Engine guru -- would this 
> reduce the time to find these results by 75% ?

It could, if you have four processors and four disk drives and things work
out optimally.

If you have a single machine with multiple processors and/or a disk array,
and your CPU or i/o are not already maxed out, then multi-threading is a
good way to make searches faster.  To implement this I would write something
like MultiSearcher, but that runs each sub-search in a separate thread, a
ThreadedMultiSearcher.

If you instead have several machines that you would like to spread search
load over, then you could use RMI to send queries to these machines.  I
would first implement the single-machine version, ThreadedMultiSearcher,
then implement a RemoteSearcher class, that forwards Searcher methods via
RMI to a Searcher object on another machine.  Then to spread load across
machines, construct a ThreadedMultiSearcher and populate it with
RemoteSearcher instances pointing at the different machines.

The Searcher API was designed with this sort of thing in mind.  Note though
that HitCollector-based searching is not a good candidate for RMI, since it
does a callback for every document.  Stick to the TopDocs-based search
method.  You'll also need to forward docFreq(Term) and maxDoc(), used to
weight the query before searching, and doc(int), used to fetch hit
documents.  Probably these should be abstracted into a separate interface,
Searchable.

Doug

--
To unsubscribe, e-mail:   
For additional commands, e-mail: