Re: Indexing off of the production servers

Furkan KAMACI Mon, 06 May 2013 06:04:15 -0700

Hi Dave;

I think that when you do indexing you can use CloudSolrServer so you can
learn from Zookeeper that where you data will go and then send your data to
there. This will speed up you when indexing and gives benefit of
Map/Reduce. Your data will be indexed by shard leaders while your replicas
are responsible for querying. Also even if you are not satisfied with you
query performance you can add more replica. If you want to improve your
indexing you can define more shards at your system (beginning with Solr 4.3
shard splitting will be a new feature for Solr.)


2013/5/6 David Parks <davidpark...@yahoo.com>

> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just off-loading
> the whole indexing process. We may just want to re-index the whole thing to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization right
> now.
>
> To re-index the whole thing is a 5-10 hour process for us, so when we move
> some update to production that requires full re-indexing (every week or
> so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
>
> It doesn't sound like there are any pre-built processes for doing this now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
>
> Dave
>
>
> -----Original Message-----
> From: Furkan KAMACI [mailto:furkankam...@gmail.com]
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
>
> Hi Erick;
>
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders you
> have at your SolrCloud, isn't it?
>
> 2013/5/6 Erick Erickson <erickerick...@gmail.com>
>
> > The only problem with using Hadoop (or whatever) is that you need to
> > be sure that documents end up on the same shard, which means that you
> > have to use the same routing mechanism that SolrCloud uses. The custom
> > doc routing may help here....
> >
> > My very first question, though, would be whether this is necessary.
> > It might be sufficient to just throttle the rate of indexing, or just
> > do the indexing during off hours or.... Have you measured an indexing
> > degradation during your heavy indexing? Indexing has costs, no
> > question, but it's worth asking whether the costs are heavy enough to
> > be worth the bother..
> >
> > Best
> > Erick
> >
> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <furkankam...@gmail.com>
> > wrote:
> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
> > > use Map/Reduce jobs you split your workload, process it, and then
> > > reduce step takes into account. Let me explain you new SolrCloud
> > > architecture. You start your SolrCluoud with a numShards parameter.
> > > Let's assume that you have 5 shards. Then you will have 5 leader at
> > > your SolrCloud. These
> > leaders
> > > will be responsible for indexing your data. It means that your
> > > indexing workload will divided into 5 so it means that you have
> > > parallelized your data as like Map/Reduce jobs.
> > >
> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > > They will be added as a replica for each shard. Then you will have 5
> > > shards, 5 leaders of them and every shard has 2 replica. When you
> > > send a query into a SolrCloud every replica will help you for
> > > searching and if
> > you
> > > add more replicas to your SolrCloud your search performance will
> improve.
> > >
> > >
> > > 2013/5/6 David Parks <davidpark...@yahoo.com>
> > >
> > >> I've had trouble figuring out what options exist if I want to
> > >> perform
> > all
> > >> indexing off of the production servers (I'd like to keep them only
> > >> for
> > user
> > >> queries).
> > >>
> > >>
> > >>
> > >> We index data in batches roughly daily, ideally I'd index all solr
> > >> cloud shards offline, then move the final index files to the solr
> > >> cloud
> > instance
> > >> that needs it and flip a switch and have it use the new index.
> > >>
> > >>
> > >>
> > >> Is this possible via either:
> > >>
> > >> 1.       Doing the indexing in Hadoop?? (this would be ideal as we
> have
> > a
> > >> significant investment in a hadoop cluster already), or
> > >>
> > >> 2.       Maintaining a separate "master" server that handles indexing
> > and
> > >> the nodes that receive user queries update their index from there
> > >> (I
> > seem
> > >> to
> > >> recall reading about this configuration in 3.x, but now we're using
> > >> solr
> > >> cloud)
> > >>
> > >>
> > >>
> > >> Is there some ideal solution I can use to "protect" the production
> > >> solr instances from degraded performance during large index
> > >> processing
> > periods?
> > >>
> > >>
> > >>
> > >> Thanks!
> > >>
> > >> David
> > >>
> > >>
> >
>
>

Re: Indexing off of the production servers

Reply via email to