Re: SolrCloud - distributed architecture considerations
On 10/14/2012 11:16 AM, Erick Erickson wrote: No, that's not what I'm thinking at all. There would be _no_ replication configured. You'd just have two completely independent installations, one in each of your separate locations. The only communication path would be that somehow the original documents would need to get to both locations for indexing. When I was on 1.4.1, I had replication set up. Because replicating between 3.2.0 and 1.4.1 was not possible due to the javabin update, I changed over to this exact model, even though the servers are right next to each other in the racks. Now I am on 3.5.0 and testing an update to branch_4x. At this time I have no plans to change my distributed setup to SolrCloud. One day I might go to two separate single-stranded SolrCloud setups in order to simplify my indexing code. Our query volume is not high enough to require more than one online server chain. The only reason I have two chains is for high availability. You might wonder why I would not take advantage of SolrCloud's automated replication. I have simply found too much value in having two independently updated copies of my distributed index. I wrote my indexing code such that it can actually update/reindex any arbitrary number of completely independent index chains. When we want to make changes to our config/schema, I have a dev server where I can do almost all of the testing required, but that server is not big enough to hold the entire index. Because I have independent production indexes, I can make the proposed changes on the B chain, reindex, and test against the full index in a staging environment without affecting the actual production site. When it is time to roll changes into production, Solr's built-in enable/disable lets me switch the load balancer back and forth between the two indexes with one click. If it works, I can then update chain A the same way. Thanks, Shawn
Re: SolrCloud - distributed architecture considerations
No, that's not what I'm thinking at all. There would be _no_ replication configured. You'd just have two completely independent installations, one in each of your separate locations. The only communication path would be that somehow the original documents would need to get to both locations for indexing. As far as each separate locations is concerned, any others don't exist. I suppose you could configure old-style replication, but then the remote data center would lag and you'd lose the NRT goodness etc. But that wouldn't be my first choice... As for your question about roles, it's hard to see how that fits. What are you envisioning would be different about the roles? SolrCloud is based upon the raw documents getting to each node and being indexed there. There's no notion of indexing the doc on one node and them moving the _results_ of the indexing process to the other replicas unless you eschew SolrCloud and just configure it as you would a 3.x system. Currently there are "leader" and "replica" roles, but all a leader does during indexing is route incoming documents appropriately (ones that belong in another shard are sent to the leader of that shard, and ones that belong in its shard to its replicas). Best Erick On Sun, Oct 14, 2012 at 12:19 PM, AlexeyK wrote: > In other words, I would have to apply a mixture of modes: SolrCloud for each > location + old-style replication for mirroring. > BTW, I've seen a notion of 'role' in node cloud state. Is it in use or is > there for future extensions? Having 'indexer' and 'searcher' roles backed by > the infrastructure would help in certain scenarios I think. > > Alexey > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594p4013616.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud - distributed architecture considerations
In other words, I would have to apply a mixture of modes: SolrCloud for each location + old-style replication for mirroring. BTW, I've seen a notion of 'role' in node cloud state. Is it in use or is there for future extensions? Having 'indexer' and 'searcher' roles backed by the infrastructure would help in certain scenarios I think. Alexey -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594p4013616.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud - distributed architecture considerations
First, remember that SolrCloud is relatively new, operational issues like this will doubtless accrue "folk wisdom" as we all gain experience... But my current thinking is that the remote installations are essentially completely separate installations with no knowledge of each other. Your indexing process needs to be smart enough to send _one_ update request to each cluster. Or perhaps each cluster has its own indexing process. The point is that the nodes in the remote installations don't know about each other, so they don't incur the communications lag. Because I'm pretty sure you're exactly right. The supposition is that the pipe connecting the remote installations is slow/expensive/whatever and the chatter that'll go back and forth if they are aware of each other will be too expensive to really be practical. YMMV, of course I can imagine low-frequency updates where at least the indexing process wouldn't care, but then you'd still have the querying crossing the expensive pipe... As you can tell, I don't have a tried-and-tested answer... Best Erick On Sun, Oct 14, 2012 at 5:41 AM, AlexeyK wrote: > Hi, > As far as I understand, SolrCloud eliminates the master-slave specifics, and > automates both update and search seamlessly. > What should I take into account configuring SolrCloud for a large customer > with multiple physical locations? > I mean, for older Solr I would define master 'close to the data' with batch > replication to the search server (slave). I would have several such slaves > for different geographical locations as well. > How can I ensure (if at all) that search queries do not cross geographical > boundaries? As far as I understand, SolrCloud routes to any arbitrary active > replica. > How can I control the indexing process so that the update request is routed > to the closest server? If SolrCloud accidentally elects some remote replica > as a current leader, the indexing process will deteriorate due to networking > issues; moreover, the update requests will be also bounced back across the > network as a part of the online replication process. > > Do I miss something fundamental in my assumptions/understanding of SolrCloud > features? > > Thanks a lot, > > Alexey > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594.html > Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud - distributed architecture considerations
Hi, As far as I understand, SolrCloud eliminates the master-slave specifics, and automates both update and search seamlessly. What should I take into account configuring SolrCloud for a large customer with multiple physical locations? I mean, for older Solr I would define master 'close to the data' with batch replication to the search server (slave). I would have several such slaves for different geographical locations as well. How can I ensure (if at all) that search queries do not cross geographical boundaries? As far as I understand, SolrCloud routes to any arbitrary active replica. How can I control the indexing process so that the update request is routed to the closest server? If SolrCloud accidentally elects some remote replica as a current leader, the indexing process will deteriorate due to networking issues; moreover, the update requests will be also bounced back across the network as a part of the online replication process. Do I miss something fundamental in my assumptions/understanding of SolrCloud features? Thanks a lot, Alexey -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: distributed architecture
Hi, thanks all, this has been very instructive. It looks like in the short term using a combination of replication and sharding, based on Upayavira's setup, might be the safest thing to do, while in the longer term following the zookeeper integration and solandra development might provide a more dynamic environment and perhaps easier setup. Please keep coming the good suggestions if you feel like. thanks again, Luca On Dec 1, 2010, at 4:17 AM, Peter Karich wrote: > Hi, > > also take a look at solandra: > > https://github.com/tjake/Lucandra/tree/solandra > > I don't have it in prod yet but regarding administration overhead it > looks very promising. > And you'll get some other neat features like (soft) real time, for free. > So its same like A) + C) + X) - Y) ;-) > > Regards, > Peter. > > >> Hi, >> I'd like to know if anybody has suggestions/opinions on what is >> currently the best architecture for a distributed search system using Solr. >> The use case is that of a system composed >> of N indexes, each hosted on a separate machine, each index containing >> unique content. >> >> Options that I know of are: >> >> A) Using Solr distributed search >> B) Using Solr + Zookeeper integration >> C) Using replication, i.e. each node replicates all the others >> >> It seems like options A) and B) would suffer from a fault-tolerance >> standpoint: if any of the nodes goes down, the search won't -at this time- >> return partial results, but instead report an exception. >> Option C) would provide fault tolerance, at least for any search initiated >> at a node that is available, but would incur into a large replication >> overhead. >> >> Did I get any of the above wrong, or does somebody have some insight on what >> is the best system architecture for this use case ? >> >> thanks in advance, >> Luca > > > -- > http://jetwick.com twitter search prototype >
Re: distributed architecture
Hi, also take a look at solandra: https://github.com/tjake/Lucandra/tree/solandra I don't have it in prod yet but regarding administration overhead it looks very promising. And you'll get some other neat features like (soft) real time, for free. So its same like A) + C) + X) - Y) ;-) Regards, Peter. Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine, each index containing unique content. Options that I know of are: A) Using Solr distributed search B) Using Solr + Zookeeper integration C) Using replication, i.e. each node replicates all the others It seems like options A) and B) would suffer from a fault-tolerance standpoint: if any of the nodes goes down, the search won't -at this time- return partial results, but instead report an exception. Option C) would provide fault tolerance, at least for any search initiated at a node that is available, but would incur into a large replication overhead. Did I get any of the above wrong, or does somebody have some insight on what is the best system architecture for this use case ? thanks in advance, Luca -- http://jetwick.com twitter search prototype
Re: distributed architecture
On Tue, 30 Nov 2010 23:11 -0800, "Dennis Gearon" wrote: > Wow, would you put a diagram somewhere up on the Solr site? > Or, here, and I will put it somewhere there. I'll see what I can do to make a diagram. > And, what is a VIP? Virtual IP. It is what a load balancer uses. You assign a 'virtual IP' to your load balancer, and it is responsible for forwarding traffic to that IP to one of the hosts in that particular pool. Upayavira
RE: distributed architecture
Okay, I'll see what I can do. Also for what it is worth, if anyone is in London tomorrow, I'm giving a presentation which covers this topic at the (free) Online Information 2010 exhibition at Kensington Olympia, at 3:20pm. Anyone interested is welcome to come along. I believe we're hoping to video it, so if successful, I expect it'll get put online somewhere. Upayavira On Wed, 01 Dec 2010 03:44 +, "Jayant Das" wrote: > > Hi, A diagram will be very much appreciated. > Thanks, > Jayant > > > From: u...@odoko.co.uk > > To: solr-user@lucene.apache.org > > Subject: Re: distributed architecture > > Date: Wed, 1 Dec 2010 00:39:40 + > > > > I cannot say how mature the code for B) is, but it is not yet included > > in a release. > > > > If you want the ability to distribute content across multiple nodes (due > > to volume) and want resilience, then use both. > > > > I've had one setup where we have two master servers, each with four > > cores. Then we have two pairs of slaves. Each pair mirrors the masters, > > so we have two hosts covering each of our cores. > > > > Then comes the complicated bit to explain... > > > > Each of these four slave hosts had a core that was configured with a > > hardwired "shards" request parameter, which pointed to each of our > > shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs > > then balanced across each of our pair of hosts. > > > > Then, put all four of these servers behind another VIP, and we had a > > single address we could push requests to, for sharded, and resilient > > search. > > > > Now if that doesn't make any sense, let me know and I'll have another go > > at explaining it (or even attempt a diagram). > > > > Upayavira > > > > On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)" > > wrote: > > > Hi, > > > I'd like to know if anybody has suggestions/opinions on what is currently > > > the best architecture for a distributed search system using Solr. The use > > > case is that of a system composed > > > of N indexes, each hosted on a separate machine, each index containing > > > unique content. > > > > > > Options that I know of are: > > > > > > A) Using Solr distributed search > > > B) Using Solr + Zookeeper integration > > > C) Using replication, i.e. each node replicates all the others > > > > > > It seems like options A) and B) would suffer from a fault-tolerance > > > standpoint: if any of the nodes goes down, the search won't -at this > > > time- return partial results, but instead report an exception. > > > Option C) would provide fault tolerance, at least for any search > > > initiated at a node that is available, but would incur into a large > > > replication overhead. > > > > > > Did I get any of the above wrong, or does somebody have some insight on > > > what is the best system architecture for this use case ? > > > > > > thanks in advance, > > > Luca >
Re: distributed architecture
Wow, would you put a diagram somewhere up on the Solr site? Or, here, and I will put it somewhere there. And, what is a VIP? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Upayavira To: "solr-user@lucene.apache.org" Sent: Tue, November 30, 2010 4:39:40 PM Subject: Re: distributed architecture I cannot say how mature the code for B) is, but it is not yet included in a release. If you want the ability to distribute content across multiple nodes (due to volume) and want resilience, then use both. I've had one setup where we have two master servers, each with four cores. Then we have two pairs of slaves. Each pair mirrors the masters, so we have two hosts covering each of our cores. Then comes the complicated bit to explain... Each of these four slave hosts had a core that was configured with a hardwired "shards" request parameter, which pointed to each of our shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs then balanced across each of our pair of hosts. Then, put all four of these servers behind another VIP, and we had a single address we could push requests to, for sharded, and resilient search. Now if that doesn't make any sense, let me know and I'll have another go at explaining it (or even attempt a diagram). Upayavira On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)" wrote: > Hi, > I'd like to know if anybody has suggestions/opinions on what is currently >the best architecture for a distributed search system using Solr. The use case >is that of a system composed > of N indexes, each hosted on a separate machine, each index containing > unique content. > > Options that I know of are: > > A) Using Solr distributed search > B) Using Solr + Zookeeper integration > C) Using replication, i.e. each node replicates all the others > > It seems like options A) and B) would suffer from a fault-tolerance > standpoint: if any of the nodes goes down, the search won't -at this > time- return partial results, but instead report an exception. > Option C) would provide fault tolerance, at least for any search > initiated at a node that is available, but would incur into a large > replication overhead. > > Did I get any of the above wrong, or does somebody have some insight on > what is the best system architecture for this use case ? > > thanks in advance, > Luca
RE: distributed architecture
Hi, A diagram will be very much appreciated. Thanks, Jayant > From: u...@odoko.co.uk > To: solr-user@lucene.apache.org > Subject: Re: distributed architecture > Date: Wed, 1 Dec 2010 00:39:40 + > > I cannot say how mature the code for B) is, but it is not yet included > in a release. > > If you want the ability to distribute content across multiple nodes (due > to volume) and want resilience, then use both. > > I've had one setup where we have two master servers, each with four > cores. Then we have two pairs of slaves. Each pair mirrors the masters, > so we have two hosts covering each of our cores. > > Then comes the complicated bit to explain... > > Each of these four slave hosts had a core that was configured with a > hardwired "shards" request parameter, which pointed to each of our > shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs > then balanced across each of our pair of hosts. > > Then, put all four of these servers behind another VIP, and we had a > single address we could push requests to, for sharded, and resilient > search. > > Now if that doesn't make any sense, let me know and I'll have another go > at explaining it (or even attempt a diagram). > > Upayavira > > On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)" > wrote: > > Hi, > > I'd like to know if anybody has suggestions/opinions on what is currently > > the best architecture for a distributed search system using Solr. The use > > case is that of a system composed > > of N indexes, each hosted on a separate machine, each index containing > > unique content. > > > > Options that I know of are: > > > > A) Using Solr distributed search > > B) Using Solr + Zookeeper integration > > C) Using replication, i.e. each node replicates all the others > > > > It seems like options A) and B) would suffer from a fault-tolerance > > standpoint: if any of the nodes goes down, the search won't -at this > > time- return partial results, but instead report an exception. > > Option C) would provide fault tolerance, at least for any search > > initiated at a node that is available, but would incur into a large > > replication overhead. > > > > Did I get any of the above wrong, or does somebody have some insight on > > what is the best system architecture for this use case ? > > > > thanks in advance, > > Luca
Re: distributed architecture
On 11/30/2010 2:27 PM, Cinquini, Luca (3880) wrote: Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine, each index containing unique content. Options that I know of are: A) Using Solr distributed search B) Using Solr + Zookeeper integration C) Using replication, i.e. each node replicates all the others It seems like options A) and B) would suffer from a fault-tolerance standpoint: if any of the nodes goes down, the search won't -at this time- return partial results, but instead report an exception. Option C) would provide fault tolerance, at least for any search initiated at a node that is available, but would incur into a large replication overhead. Exactly what will work best for you is highly dependent on your specific requirements. Answers to the following questions will influence the choice. If you choose distributed, they will also affect how you distribute the data among the different machines: * How many documents? * How much disk space will the index take up? * Do you need uniform IDF across the entire corpus? * How often do you need to insert new content? * How often do you need to delete old content? * What query volume do you have to support? For my index, I use distributed+replicated, for redundancy. Statistics: * Low query volume. * 49 million documents * Total index size: 87GB (using 1024, not 1000) * Adding about 40-5 new documents every day * Inserts done every two minutes. * Deletes done every ten minutes. * Six shards with static content. Each one is 14GB and 8+ million documents. * One shard with content less than a week old. It is usually less than 1GB in size. The entire system consists of 18 virtual machines on four physical hosts. There is a master VMs and a slave VM for each shard. The static shard VMs have 9GB of RAM, the recent content shard has 3GB of RAM. Two VMs with 1.5GB of RAM are Solr instances that do not have indexes, they are used as search brokers. Two VMs are load balancers, running heartbeat and HAProxy. The load balancers present the two search brokers as a single IP address. Each of the VMs in a pair is running on a different physical host. By low query volume, I mean that the average queries per second is less than 1, but this is over a long time period, so it includes nights and weekends when there is very little traffic. Even during heavy times, I would estimate that the queries per second is still single digit. Over the last 11 days and 21 hours, the website made 320,000 queries. In the same time period, there were 340,000 load balancer health-check queries, at a rate of two every five seconds. Shawn
Re: distributed architecture
I cannot say how mature the code for B) is, but it is not yet included in a release. If you want the ability to distribute content across multiple nodes (due to volume) and want resilience, then use both. I've had one setup where we have two master servers, each with four cores. Then we have two pairs of slaves. Each pair mirrors the masters, so we have two hosts covering each of our cores. Then comes the complicated bit to explain... Each of these four slave hosts had a core that was configured with a hardwired "shards" request parameter, which pointed to each of our shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs then balanced across each of our pair of hosts. Then, put all four of these servers behind another VIP, and we had a single address we could push requests to, for sharded, and resilient search. Now if that doesn't make any sense, let me know and I'll have another go at explaining it (or even attempt a diagram). Upayavira On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)" wrote: > Hi, > I'd like to know if anybody has suggestions/opinions on what is > currently the best architecture for a distributed search system using Solr. > The use case is that of a system composed > of N indexes, each hosted on a separate machine, each index containing > unique content. > > Options that I know of are: > > A) Using Solr distributed search > B) Using Solr + Zookeeper integration > C) Using replication, i.e. each node replicates all the others > > It seems like options A) and B) would suffer from a fault-tolerance > standpoint: if any of the nodes goes down, the search won't -at this > time- return partial results, but instead report an exception. > Option C) would provide fault tolerance, at least for any search > initiated at a node that is available, but would incur into a large > replication overhead. > > Did I get any of the above wrong, or does somebody have some insight on > what is the best system architecture for this use case ? > > thanks in advance, > Luca
distributed architecture
Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine, each index containing unique content. Options that I know of are: A) Using Solr distributed search B) Using Solr + Zookeeper integration C) Using replication, i.e. each node replicates all the others It seems like options A) and B) would suffer from a fault-tolerance standpoint: if any of the nodes goes down, the search won't -at this time- return partial results, but instead report an exception. Option C) would provide fault tolerance, at least for any search initiated at a node that is available, but would incur into a large replication overhead. Did I get any of the above wrong, or does somebody have some insight on what is the best system architecture for this use case ? thanks in advance, Luca
Re: Distributed Architecture
4 big indices if those servers can handle them. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "Woytowitz, Matthew" > To: solr-user@lucene.apache.org > Sent: Tuesday, May 26, 2009 1:01:48 PM > Subject: Distributed Architecture > > I have 4 servers each with 8 cores and 32 gigs of ram with 2TB of SAN > space for each server. We want to distribute the index across the 4 > servers using shards. > > > > What would be better: > > > > 4 big indexes. One on each server. 4 total shards, one per server. > > > > X number of smaller indexes, X on each server. 4X total shards, X per > server. > > How should I divide this up? Separate Java process with separate > solrconfig dirs or one big 64-bit JVM with separate cores. > > > > Anyone done anything like this? > > > > Thanks, > > > > Matt Woytowitz
Distributed Architecture
I have 4 servers each with 8 cores and 32 gigs of ram with 2TB of SAN space for each server. We want to distribute the index across the 4 servers using shards. What would be better: 4 big indexes. One on each server. 4 total shards, one per server. X number of smaller indexes, X on each server. 4X total shards, X per server. How should I divide this up? Separate Java process with separate solrconfig dirs or one big 64-bit JVM with separate cores. Anyone done anything like this? Thanks, Matt Woytowitz