Re: Scalability of Lucene indexes

2005-02-19 Thread Bryan McCormick
Our index is currently about 40Gb. 

The advantage of binding a user is that once a search is performed then
caching within lucene and in the application is very effective if
subsequent searches go back to the same box.  Our initial searches are
usually in the sub 100milliS range while subsequent requests for deeper
pages in the search are returned instantly. 

Bryan McCormick

On Sat, 2005-02-19 at 01:24, Andy wrote:
> Hi Bryan,
> 
> How big is your index?
> 
> Also what is the advantage of binding a user to a
> server? 
> 
> Thanks.
> Andy
> 
> --- Bryan McCormick <[EMAIL PROTECTED]> wrote:
> 
> > Hi chris, 
> > 
> > I'm responsible for the webshots.com search index
> > and we've had very
> > good results with lucene. It currently indexes over
> > 100 Million
> > documents and performs 4 Million searches / day. 
> > 
> > We initially tested running multiple small copies
> > and using a
> > MultiSearcher and then merging results as compared
> > to running a very
> > large single index. We actually found that the
> > single large instance
> > performed better. To improve load handling we
> > clustered multiple
> > identical copies together, then session bind a user
> > to particular server
> > and cache the results, but each server is running a
> > single index. 
> > 
> > Bryan McCormick
> > 
> > 
> > On Fri, 2005-02-18 at 08:01, Chris D wrote: 
> > > Hi all, 
> > > 
> > > I have a question about scaling lucene across a
> > cluster, and good ways
> > > of breaking up the work.
> > > 
> > > We have a very large index and searches sometimes
> > take more time than
> > > they're allowed. What we have been doing is during
> > indexing we index
> > > into 256 seperate indexes (depending on the
> > md5sum) then distribute
> > > the indexes to the search machines. So if a
> > machine has 128 indexes it
> > > would have to do 128 searches. I gave
> > parallelMultiSearcher a try and
> > > it was significantly slower than simply iterating
> > through the indexes
> > > one at a time.
> > > 
> > > Our new plan is to somehow have only one index per
> > search machine and
> > > a larger main index stored on the master.
> > > 
> > > What I'm interested to know is whether having one
> > extremely large
> > > index for the master then splitting the index into
> > several smaller
> > > indexes (if this is possible) would be better than
> > having several
> > > smaller indexes and merging them on the search
> > machines into one
> > > index.
> > > 
> > > I would also be interested to know how others have
> > divided up search
> > > work across a cluster.
> > > 
> > > Thanks,
> > > Chris
> > > 
> > >
> >
> -
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > 
> > 
> > 
> >
> -
> > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > 
> > 
> 
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scalability of Lucene indexes

2005-02-19 Thread Praveen Peddi
We are doing the same exacting thing. We didn't test with so many documents. 
The most we tested till now 3 million documents with 3GB file size.
I would be interested in seeing how you maintained replicated indices that r 
in sync. The way we did was, run the indexer on each server independently. I 
the data changes, one server will know the change. That server updates 
lucene index and notifies other servers (using multicast).

Glad to know someone else is doing the similar thing and more happy to know 
that the solution works even for 100 millions documents. I was little 
worried if the index size goes higher and higher but it looks like we should 
not have to worry anymore :)

Thanks
Praveen
- Original Message - 
From: "Bryan McCormick" <[EMAIL PROTECTED]>
To: "Chris D" <[EMAIL PROTECTED]>
Cc: 
Sent: Friday, February 18, 2005 3:45 PM
Subject: Re: Scalability of Lucene indexes


Hi chris,
I'm responsible for the webshots.com search index and we've had very
good results with lucene. It currently indexes over 100 Million
documents and performs 4 Million searches / day.
We initially tested running multiple small copies and using a
MultiSearcher and then merging results as compared to running a very
large single index. We actually found that the single large instance
performed better. To improve load handling we clustered multiple
identical copies together, then session bind a user to particular server
and cache the results, but each server is running a single index.
Bryan McCormick
On Fri, 2005-02-18 at 08:01, Chris D wrote:
Hi all,
I have a question about scaling lucene across a cluster, and good ways
of breaking up the work.
We have a very large index and searches sometimes take more time than
they're allowed. What we have been doing is during indexing we index
into 256 seperate indexes (depending on the md5sum) then distribute
the indexes to the search machines. So if a machine has 128 indexes it
would have to do 128 searches. I gave parallelMultiSearcher a try and
it was significantly slower than simply iterating through the indexes
one at a time.
Our new plan is to somehow have only one index per search machine and
a larger main index stored on the master.
What I'm interested to know is whether having one extremely large
index for the master then splitting the index into several smaller
indexes (if this is possible) would be better than having several
smaller indexes and merging them on the search machines into one
index.
I would also be interested to know how others have divided up search
work across a cluster.
Thanks,
Chris
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Scalability of Lucene indexes

2005-02-19 Thread Andy
Hi Bryan,

How big is your index?

Also what is the advantage of binding a user to a
server? 

Thanks.
Andy

--- Bryan McCormick <[EMAIL PROTECTED]> wrote:

> Hi chris, 
> 
> I'm responsible for the webshots.com search index
> and we've had very
> good results with lucene. It currently indexes over
> 100 Million
> documents and performs 4 Million searches / day. 
> 
> We initially tested running multiple small copies
> and using a
> MultiSearcher and then merging results as compared
> to running a very
> large single index. We actually found that the
> single large instance
> performed better. To improve load handling we
> clustered multiple
> identical copies together, then session bind a user
> to particular server
> and cache the results, but each server is running a
> single index. 
> 
> Bryan McCormick
> 
> 
> On Fri, 2005-02-18 at 08:01, Chris D wrote: 
> > Hi all, 
> > 
> > I have a question about scaling lucene across a
> cluster, and good ways
> > of breaking up the work.
> > 
> > We have a very large index and searches sometimes
> take more time than
> > they're allowed. What we have been doing is during
> indexing we index
> > into 256 seperate indexes (depending on the
> md5sum) then distribute
> > the indexes to the search machines. So if a
> machine has 128 indexes it
> > would have to do 128 searches. I gave
> parallelMultiSearcher a try and
> > it was significantly slower than simply iterating
> through the indexes
> > one at a time.
> > 
> > Our new plan is to somehow have only one index per
> search machine and
> > a larger main index stored on the master.
> > 
> > What I'm interested to know is whether having one
> extremely large
> > index for the master then splitting the index into
> several smaller
> > indexes (if this is possible) would be better than
> having several
> > smaller indexes and merging them on the search
> machines into one
> > index.
> > 
> > I would also be interested to know how others have
> divided up search
> > work across a cluster.
> > 
> > Thanks,
> > Chris
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scalability of Lucene indexes

2005-02-18 Thread Bryan McCormick
Hi chris, 

I'm responsible for the webshots.com search index and we've had very
good results with lucene. It currently indexes over 100 Million
documents and performs 4 Million searches / day. 

We initially tested running multiple small copies and using a
MultiSearcher and then merging results as compared to running a very
large single index. We actually found that the single large instance
performed better. To improve load handling we clustered multiple
identical copies together, then session bind a user to particular server
and cache the results, but each server is running a single index. 

Bryan McCormick


On Fri, 2005-02-18 at 08:01, Chris D wrote: 
> Hi all, 
> 
> I have a question about scaling lucene across a cluster, and good ways
> of breaking up the work.
> 
> We have a very large index and searches sometimes take more time than
> they're allowed. What we have been doing is during indexing we index
> into 256 seperate indexes (depending on the md5sum) then distribute
> the indexes to the search machines. So if a machine has 128 indexes it
> would have to do 128 searches. I gave parallelMultiSearcher a try and
> it was significantly slower than simply iterating through the indexes
> one at a time.
> 
> Our new plan is to somehow have only one index per search machine and
> a larger main index stored on the master.
> 
> What I'm interested to know is whether having one extremely large
> index for the master then splitting the index into several smaller
> indexes (if this is possible) would be better than having several
> smaller indexes and merging them on the search machines into one
> index.
> 
> I would also be interested to know how others have divided up search
> work across a cluster.
> 
> Thanks,
> Chris
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Scalability of Lucene indexes

2005-02-18 Thread Chris D
Hi all, 

I have a question about scaling lucene across a cluster, and good ways
of breaking up the work.

We have a very large index and searches sometimes take more time than
they're allowed. What we have been doing is during indexing we index
into 256 seperate indexes (depending on the md5sum) then distribute
the indexes to the search machines. So if a machine has 128 indexes it
would have to do 128 searches. I gave parallelMultiSearcher a try and
it was significantly slower than simply iterating through the indexes
one at a time.

Our new plan is to somehow have only one index per search machine and
a larger main index stored on the master.

What I'm interested to know is whether having one extremely large
index for the master then splitting the index into several smaller
indexes (if this is possible) would be better than having several
smaller indexes and merging them on the search machines into one
index.

I would also be interested to know how others have divided up search
work across a cluster.

Thanks,
Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]