I'm less concerned with fully utilizing a hadoop cluster (due to having fewer shards than I have hadoop reduce slots) as I am with just off-loading the whole indexing process. We may just want to re-index the whole thing to add some index time boosts or whatever else we conjure up to make queries faster and better quality. We're doing a lot of work on optimization right now.
To re-index the whole thing is a 5-10 hour process for us, so when we move some update to production that requires full re-indexing (every week or so), right now we're just re-building new instances of solr to handle the re-indexing and then copying the final VMs to the production environment (slow process). I'm leery of letting a heavy duty full re-index process loose for 10 hours on production on a regular basis. It doesn't sound like there are any pre-built processes for doing this now though. I thought I had heard of master/slave hierarchy in 3.x that would allow us to designate a master to do indexing and let the slaves pull finished indexes from the master, so I thought maybe something like that followed into solr cloud. Eric might be right in that it's not worth the effort if there isn't some existing strategy. Dave -----Original Message----- From: Furkan KAMACI [mailto:furkankam...@gmail.com] Sent: Monday, May 06, 2013 7:06 PM To: solr-user@lucene.apache.org Subject: Re: Indexing off of the production servers Hi Erick; I think that even if you use Map/Reduce you will not parallelize you indexing because indexing will parallelize as much as how many leaders you have at your SolrCloud, isn't it? 2013/5/6 Erick Erickson <erickerick...@gmail.com> > The only problem with using Hadoop (or whatever) is that you need to > be sure that documents end up on the same shard, which means that you > have to use the same routing mechanism that SolrCloud uses. The custom > doc routing may help here.... > > My very first question, though, would be whether this is necessary. > It might be sufficient to just throttle the rate of indexing, or just > do the indexing during off hours or.... Have you measured an indexing > degradation during your heavy indexing? Indexing has costs, no > question, but it's worth asking whether the costs are heavy enough to > be worth the bother.. > > Best > Erick > > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI <furkankam...@gmail.com> > wrote: > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you > > use Map/Reduce jobs you split your workload, process it, and then > > reduce step takes into account. Let me explain you new SolrCloud > > architecture. You start your SolrCluoud with a numShards parameter. > > Let's assume that you have 5 shards. Then you will have 5 leader at > > your SolrCloud. These > leaders > > will be responsible for indexing your data. It means that your > > indexing workload will divided into 5 so it means that you have > > parallelized your data as like Map/Reduce jobs. > > > > Let's assume that you have added 10 new Solr nodes into your SolrCloud. > > They will be added as a replica for each shard. Then you will have 5 > > shards, 5 leaders of them and every shard has 2 replica. When you > > send a query into a SolrCloud every replica will help you for > > searching and if > you > > add more replicas to your SolrCloud your search performance will improve. > > > > > > 2013/5/6 David Parks <davidpark...@yahoo.com> > > > >> I've had trouble figuring out what options exist if I want to > >> perform > all > >> indexing off of the production servers (I'd like to keep them only > >> for > user > >> queries). > >> > >> > >> > >> We index data in batches roughly daily, ideally I'd index all solr > >> cloud shards offline, then move the final index files to the solr > >> cloud > instance > >> that needs it and flip a switch and have it use the new index. > >> > >> > >> > >> Is this possible via either: > >> > >> 1. Doing the indexing in Hadoop?? (this would be ideal as we have > a > >> significant investment in a hadoop cluster already), or > >> > >> 2. Maintaining a separate "master" server that handles indexing > and > >> the nodes that receive user queries update their index from there > >> (I > seem > >> to > >> recall reading about this configuration in 3.x, but now we're using > >> solr > >> cloud) > >> > >> > >> > >> Is there some ideal solution I can use to "protect" the production > >> solr instances from degraded performance during large index > >> processing > periods? > >> > >> > >> > >> Thanks! > >> > >> David > >> > >> >