subject:"\"ParallellMultiSearcher Vs. One big Index\""

RE: ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Ryan Aslett

The test system is not multithreaded currently, i.e. the queries are
executed serially.
Which explains why the multi-term, single index was slower.. Ie. Only
using one thread vs the parallel multisearcher using many.
I had plenty of CPU on the multi-term single index.  So if I were to
make my querier multithreaded, the fastest index configuration would
ideally be one big index?

Thanks you for your help!
Ryan

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 18, 2005 11:32 AM
To: Lucene Users List
Subject: Re: ParallellMultiSearcher Vs. One big Index

Ryan Aslett wrote:
> What I found was that for queries with one term (First Name), the
large
> index beat the multiple indexes hands down (280 Queries/per second vs
> 170 Q/s).
> But for queries with multiple terms (Address), the multiple indexes
beat
> out the Large index. (26 Q/s vs 16 Q/s)
> Btw, Im running these on a 2 proc box with 16GB of ram.
> 
> So what Im trying to determine Is if there is some equations out there
> that can help me find the sweet spot for splitting my indexes.

What appears to be the bottleneck, CPU or i/o?  Is your test system 
multi-threaded?  I.e., is it attempting to execute many queries in 
parallel?  If you're CPU-bound then a single index should be fastest. 
Are you using compound format?  If you're i/o-bound, the non-compound 
format may be somewhat faster, as it permits more parallel i/o.  Is the 
index data on multiple drives?  If you're i/o bound then it should be 
faster to use multiple drives.  To permit even more parallel i/o over 
multiple drives you might consider using a pool of IndexReaders.  That 
way, with, e.g., striped data, each could be simultaneously reading 
different portions of the same file.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Doug Cutting

Ryan Aslett wrote:
What I found was that for queries with one term (First Name), the large
index beat the multiple indexes hands down (280 Queries/per second vs
170 Q/s).
But for queries with multiple terms (Address), the multiple indexes beat
out the Large index. (26 Q/s vs 16 Q/s)
Btw, Im running these on a 2 proc box with 16GB of ram.
So what Im trying to determine Is if there is some equations out there
that can help me find the sweet spot for splitting my indexes.
What appears to be the bottleneck, CPU or i/o?  Is your test system 
multi-threaded?  I.e., is it attempting to execute many queries in 
parallel?  If you're CPU-bound then a single index should be fastest. 
Are you using compound format?  If you're i/o-bound, the non-compound 
format may be somewhat faster, as it permits more parallel i/o.  Is the 
index data on multiple drives?  If you're i/o bound then it should be 
faster to use multiple drives.  To permit even more parallel i/o over 
multiple drives you might consider using a pool of IndexReaders.  That 
way, with, e.g., striped data, each could be simultaneously reading 
different portions of the same file.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Ryan Aslett

 
Okay, so Im trying to find the sweet spot on how many index segments I
should have.

I have 47 million records of contact data (Name + Address). I used 7
machines to build indexes that resulted in the following spread of
individual indexes:

1503000
150
1497000
5604750
5379750
1437000
1458000
1446000
1422000
1425000
1425000
1404000
1413000
1404000
4893750
4689750
4519500
4497750
46919250 Total Records
(The faster machines built the bigger indexes)
I also joined all these indexes together into one large 47 million
record index, and ran my query pounder against both data sets, one using
the ParallellMultiSearcher for the multi indexes, and one using a normal
IndexSearcher against the large index.
What I found was that for queries with one term (First Name), the large
index beat the multiple indexes hands down (280 Queries/per second vs
170 Q/s).
But for queries with multiple terms (Address), the multiple indexes beat
out the Large index. (26 Q/s vs 16 Q/s)
Btw, Im running these on a 2 proc box with 16GB of ram.

So what Im trying to determine Is if there is some equations out there
that can help me find the sweet spot for splitting my indexes. Most
queries are going to be multi-term, and clearly the big O of the single
term search appears to be log n. (I verified with 470 million records..
The single term search returns at 140 qps, consistent with what I
believe about search algorithms).  The equation that Im missing is the
big O for the union of the result sets that match particular terms.  Im
assuming (havent looked at the source yet) that lucene finds all the
documents that match the first term, and all the documents that match
each subsequent term, and then finds the union between all the sets. Is
this correct?  Anybody have any ideas on how to iron out an equation for
this?

Ryan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: ParallellMultiSearcher Vs. One big Index

Re: ParallellMultiSearcher Vs. One big Index

ParallellMultiSearcher Vs. One big Index

3 matches

Site Navigation

Mail list logo

Footer information