I'm less concerned with fully utilizing a hadoop cluster (due to having
fewer shards than I have hadoop reduce slots) as I am with just off-loading
the whole indexing process. We may just want to re-index the whole thing to
add some index time boosts or whatever else we conjure up to make queries
faster and better quality. We're doing a lot of work on optimization right
now.

To re-index the whole thing is a 5-10 hour process for us, so when we move
some update to production that requires full re-indexing (every week or so),
right now we're just re-building new instances of solr to handle the
re-indexing and then copying the final VMs to the production environment
(slow process). I'm leery of letting a heavy duty full re-index process
loose for 10 hours on production on a regular basis.

It doesn't sound like there are any pre-built processes for doing this now
though. I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud. Eric might be right in that it's not worth the
effort if there isn't some existing strategy.

Dave


-----Original Message-----
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, May 06, 2013 7:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson <erickerick...@gmail.com>

> The only problem with using Hadoop (or whatever) is that you need to 
> be sure that documents end up on the same shard, which means that you 
> have to use the same routing mechanism that SolrCloud uses. The custom 
> doc routing may help here....
>
> My very first question, though, would be whether this is necessary.
> It might be sufficient to just throttle the rate of indexing, or just 
> do the indexing during off hours or.... Have you measured an indexing 
> degradation during your heavy indexing? Indexing has costs, no 
> question, but it's worth asking whether the costs are heavy enough to 
> be worth the bother..
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <furkankam...@gmail.com>
> wrote:
> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you 
> > use Map/Reduce jobs you split your workload, process it, and then 
> > reduce step takes into account. Let me explain you new SolrCloud 
> > architecture. You start your SolrCluoud with a numShards parameter. 
> > Let's assume that you have 5 shards. Then you will have 5 leader at 
> > your SolrCloud. These
> leaders
> > will be responsible for indexing your data. It means that your 
> > indexing workload will divided into 5 so it means that you have 
> > parallelized your data as like Map/Reduce jobs.
> >
> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > They will be added as a replica for each shard. Then you will have 5 
> > shards, 5 leaders of them and every shard has 2 replica. When you 
> > send a query into a SolrCloud every replica will help you for 
> > searching and if
> you
> > add more replicas to your SolrCloud your search performance will
improve.
> >
> >
> > 2013/5/6 David Parks <davidpark...@yahoo.com>
> >
> >> I've had trouble figuring out what options exist if I want to 
> >> perform
> all
> >> indexing off of the production servers (I'd like to keep them only 
> >> for
> user
> >> queries).
> >>
> >>
> >>
> >> We index data in batches roughly daily, ideally I'd index all solr 
> >> cloud shards offline, then move the final index files to the solr 
> >> cloud
> instance
> >> that needs it and flip a switch and have it use the new index.
> >>
> >>
> >>
> >> Is this possible via either:
> >>
> >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we have
> a
> >> significant investment in a hadoop cluster already), or
> >>
> >> 2.       Maintaining a separate "master" server that handles indexing
> and
> >> the nodes that receive user queries update their index from there 
> >> (I
> seem
> >> to
> >> recall reading about this configuration in 3.x, but now we're using 
> >> solr
> >> cloud)
> >>
> >>
> >>
> >> Is there some ideal solution I can use to "protect" the production 
> >> solr instances from degraded performance during large index 
> >> processing
> periods?
> >>
> >>
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
>

Reply via email to