Re: SolrCloud - distributed architecture considerations

2012-10-14 Thread Shawn Heisey

On 10/14/2012 11:16 AM, Erick Erickson wrote:

No, that's not what I'm thinking at all. There would be _no_
replication configured. You'd just have two completely independent
installations, one in each of your separate locations. The only
communication path would be that somehow the original documents
would need to get to both locations for indexing.


When I was on 1.4.1, I had replication set up.  Because replicating 
between 3.2.0 and 1.4.1 was not possible due to the javabin update, I 
changed over to this exact model, even though the servers are right next 
to each other in the racks.  Now I am on 3.5.0 and testing an update to 
branch_4x.  At this time I have no plans to change my distributed setup 
to SolrCloud.  One day I might go to two separate single-stranded 
SolrCloud setups in order to simplify my indexing code.  Our query 
volume is not high enough to require more than one online server chain.  
The only reason I have two chains is for high availability.


You might wonder why I would not take advantage of SolrCloud's automated 
replication.  I have simply found too much value in having two 
independently updated copies of my distributed index.  I wrote my 
indexing code such that it can actually update/reindex any arbitrary 
number of completely independent index chains.


When we want to make changes to our config/schema, I have a dev server 
where I can do almost all of the testing required, but that server is 
not big enough to hold the entire index.  Because I have independent 
production indexes, I can make the proposed changes on the B chain, 
reindex, and test against the full index in a staging environment 
without affecting the actual production site.  When it is time to roll 
changes into production, Solr's built-in enable/disable lets me switch 
the load balancer back and forth between the two indexes with one 
click.  If it works, I can then update chain A the same way.


Thanks,
Shawn



Re: SolrCloud - distributed architecture considerations

2012-10-14 Thread Erick Erickson
No, that's not what I'm thinking at all. There would be _no_
replication configured. You'd just have two completely independent
installations, one in each of your separate locations. The only
communication path would be that somehow the original documents
would need to get to both locations for indexing.

As far as each separate locations is concerned, any others
don't exist.

I suppose you could configure old-style replication, but then
the remote data center would lag and you'd lose the NRT
goodness etc. But that wouldn't be my first choice...

As for your question about roles, it's hard to see how that fits. What
are you envisioning would be different about the roles? SolrCloud is
based upon the raw documents getting to each node and being
indexed there. There's no notion of indexing the doc on one node and
them moving the _results_ of the indexing process to the other
replicas unless you eschew SolrCloud and just configure it as you
would a 3.x system.

Currently there are "leader" and "replica" roles, but all a leader does
during indexing is route incoming documents appropriately (ones that
belong in another shard are sent to the leader of that shard, and
ones that belong in its shard to its replicas).

Best
Erick

On Sun, Oct 14, 2012 at 12:19 PM, AlexeyK  wrote:
> In other words, I would have to apply a mixture of modes: SolrCloud for each
> location + old-style replication for mirroring.
> BTW, I've seen a notion of 'role' in node cloud state. Is it in use or is
> there for future extensions? Having 'indexer' and 'searcher' roles backed by
> the infrastructure would help in certain scenarios I think.
>
> Alexey
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594p4013616.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - distributed architecture considerations

2012-10-14 Thread AlexeyK
In other words, I would have to apply a mixture of modes: SolrCloud for each
location + old-style replication for mirroring. 
BTW, I've seen a notion of 'role' in node cloud state. Is it in use or is
there for future extensions? Having 'indexer' and 'searcher' roles backed by
the infrastructure would help in certain scenarios I think.

Alexey




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594p4013616.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - distributed architecture considerations

2012-10-14 Thread Erick Erickson
First, remember that SolrCloud is relatively new, operational issues
like this will doubtless accrue "folk wisdom" as we all gain experience...

But my current thinking is that the remote installations are essentially
completely separate installations with no knowledge of each other. Your
indexing process needs to be smart enough to send _one_ update
request to each cluster. Or perhaps each cluster has its own indexing
process. The point is that the nodes in the remote installations don't
know about each other, so they don't incur the communications lag.

Because I'm pretty sure you're exactly right. The supposition is that
the pipe connecting the remote installations is slow/expensive/whatever
and the chatter that'll go back and forth if they are aware of each other
will be too expensive to really be practical. YMMV, of course I can
imagine low-frequency updates where at least the indexing process
wouldn't care, but then you'd still have the querying crossing the
expensive pipe...

As you can tell, I don't have a tried-and-tested answer...

Best
Erick



On Sun, Oct 14, 2012 at 5:41 AM, AlexeyK  wrote:
> Hi,
> As far as I understand, SolrCloud eliminates the master-slave specifics, and
> automates both update and search seamlessly.
> What should I take into account configuring SolrCloud for a large customer
> with multiple physical locations?
> I mean, for older Solr I would define master 'close to the data' with batch
> replication to the search server (slave). I would have several such slaves
> for different geographical locations as well.
> How can I ensure (if at all) that search queries do not cross geographical
> boundaries? As far as I understand, SolrCloud routes to any arbitrary active
> replica.
> How can I control the indexing process so that the update request is routed
> to the closest server? If SolrCloud accidentally elects some remote replica
> as a current leader, the indexing process will deteriorate due to networking
> issues; moreover, the update requests will be also bounced back across the
> network as a part of the online replication process.
>
> Do I miss something fundamental in my assumptions/understanding of SolrCloud
> features?
>
> Thanks a lot,
>
> Alexey
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594.html
> Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud - distributed architecture considerations

2012-10-14 Thread AlexeyK
Hi,
As far as I understand, SolrCloud eliminates the master-slave specifics, and
automates both update and search seamlessly.
What should I take into account configuring SolrCloud for a large customer
with multiple physical locations?
I mean, for older Solr I would define master 'close to the data' with batch
replication to the search server (slave). I would have several such slaves
for different geographical locations as well.
How can I ensure (if at all) that search queries do not cross geographical
boundaries? As far as I understand, SolrCloud routes to any arbitrary active
replica.
How can I control the indexing process so that the update request is routed
to the closest server? If SolrCloud accidentally elects some remote replica
as a current leader, the indexing process will deteriorate due to networking
issues; moreover, the update requests will be also bounced back across the
network as a part of the online replication process.

Do I miss something fundamental in my assumptions/understanding of SolrCloud
features?

Thanks a lot,

Alexey



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-distributed-architecture-considerations-tp4013594.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: distributed architecture

2010-12-01 Thread Cinquini, Luca (3880)
Hi,
thanks all, this has been very instructive. It looks like in the short 
term using a combination of replication and sharding, based on Upayavira's 
setup, might be the safest thing to do, while in the longer term following the 
zookeeper integration and solandra development might provide a more dynamic 
environment and perhaps easier setup.
Please keep coming the good suggestions if you feel like.
thanks again,
Luca

On Dec 1, 2010, at 4:17 AM, Peter Karich wrote:

>  Hi,
> 
> also take a look at solandra:
> 
> https://github.com/tjake/Lucandra/tree/solandra
> 
> I don't have it in prod yet but regarding administration overhead it 
> looks very promising.
> And you'll get some other neat features like (soft) real time, for free. 
> So its same like A) +  C) + X) - Y) ;-)
> 
> Regards,
> Peter.
> 
> 
>> Hi,
>>  I'd like to know if anybody has suggestions/opinions on what is 
>> currently the best architecture for a distributed search system using Solr. 
>> The use case is that of a system composed
>> of N indexes, each hosted on a separate machine, each index containing 
>> unique content.
>> 
>> Options that I know of are:
>> 
>> A) Using Solr distributed search
>> B) Using Solr + Zookeeper integration
>> C) Using replication, i.e. each node replicates all the others
>> 
>> It seems like options A) and B) would suffer from a fault-tolerance 
>> standpoint: if any of the nodes goes down, the search won't -at this time- 
>> return partial results, but instead report an exception.
>> Option C) would provide fault tolerance, at least for any search initiated 
>> at a node that is available, but would incur into a large replication 
>> overhead.
>> 
>> Did I get any of the above wrong, or does somebody have some insight on what 
>> is the best system architecture for this use case ?
>> 
>> thanks in advance,
>> Luca
> 
> 
> -- 
> http://jetwick.com twitter search prototype
> 



Re: distributed architecture

2010-12-01 Thread Peter Karich

 Hi,

also take a look at solandra:

https://github.com/tjake/Lucandra/tree/solandra

I don't have it in prod yet but regarding administration overhead it 
looks very promising.
And you'll get some other neat features like (soft) real time, for free. 
So its same like A) +  C) + X) - Y) ;-)


Regards,
Peter.



Hi,
I'd like to know if anybody has suggestions/opinions on what is 
currently the best architecture for a distributed search system using Solr. The 
use case is that of a system composed
of N indexes, each hosted on a separate machine, each index containing unique 
content.

Options that I know of are:

A) Using Solr distributed search
B) Using Solr + Zookeeper integration
C) Using replication, i.e. each node replicates all the others

It seems like options A) and B) would suffer from a fault-tolerance standpoint: 
if any of the nodes goes down, the search won't -at this time- return partial 
results, but instead report an exception.
Option C) would provide fault tolerance, at least for any search initiated at a 
node that is available, but would incur into a large replication overhead.

Did I get any of the above wrong, or does somebody have some insight on what is 
the best system architecture for this use case ?

thanks in advance,
Luca



--
http://jetwick.com twitter search prototype



Re: distributed architecture

2010-12-01 Thread Upayavira
On Tue, 30 Nov 2010 23:11 -0800, "Dennis Gearon" 
wrote:
> Wow, would you put a diagram somewhere up on the Solr site?

> Or, here, and I will put it somewhere there.

I'll see what I can do to make a diagram.

> And, what is a VIP?

Virtual IP. It is what a load balancer uses. You assign a 'virtual IP'
to your load balancer, and it is responsible for forwarding traffic to
that IP to one of the hosts in that particular pool.

Upayavira


RE: distributed architecture

2010-12-01 Thread Upayavira
Okay, I'll see what I can do. 

Also for what it is worth, if anyone is in London tomorrow, I'm giving a
presentation which covers this topic at the (free) Online Information
2010 exhibition at Kensington Olympia, at 3:20pm. Anyone interested is
welcome to come along. I believe we're hoping to video it, so if
successful, I expect it'll get put online somewhere.

Upayavira

On Wed, 01 Dec 2010 03:44 +, "Jayant Das" 
wrote:
> 
> Hi, A diagram will be very much appreciated.
> Thanks,
> Jayant
>  
> > From: u...@odoko.co.uk
> > To: solr-user@lucene.apache.org
> > Subject: Re: distributed architecture
> > Date: Wed, 1 Dec 2010 00:39:40 +
> > 
> > I cannot say how mature the code for B) is, but it is not yet included
> > in a release.
> > 
> > If you want the ability to distribute content across multiple nodes (due
> > to volume) and want resilience, then use both.
> > 
> > I've had one setup where we have two master servers, each with four
> > cores. Then we have two pairs of slaves. Each pair mirrors the masters,
> > so we have two hosts covering each of our cores.
> > 
> > Then comes the complicated bit to explain...
> > 
> > Each of these four slave hosts had a core that was configured with a
> > hardwired "shards" request parameter, which pointed to each of our
> > shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs
> > then balanced across each of our pair of hosts.
> > 
> > Then, put all four of these servers behind another VIP, and we had a
> > single address we could push requests to, for sharded, and resilient
> > search.
> > 
> > Now if that doesn't make any sense, let me know and I'll have another go
> > at explaining it (or even attempt a diagram).
> > 
> > Upayavira
> > 
> > On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)"
> >  wrote:
> > > Hi,
> > > I'd like to know if anybody has suggestions/opinions on what is currently 
> > > the best architecture for a distributed search system using Solr. The use 
> > > case is that of a system composed
> > > of N indexes, each hosted on a separate machine, each index containing
> > > unique content.
> > > 
> > > Options that I know of are:
> > > 
> > > A) Using Solr distributed search
> > > B) Using Solr + Zookeeper integration
> > > C) Using replication, i.e. each node replicates all the others
> > > 
> > > It seems like options A) and B) would suffer from a fault-tolerance
> > > standpoint: if any of the nodes goes down, the search won't -at this
> > > time- return partial results, but instead report an exception.
> > > Option C) would provide fault tolerance, at least for any search
> > > initiated at a node that is available, but would incur into a large
> > > replication overhead.
> > > 
> > > Did I get any of the above wrong, or does somebody have some insight on
> > > what is the best system architecture for this use case ?
> > > 
> > > thanks in advance,
> > > Luca
> 


Re: distributed architecture

2010-11-30 Thread Dennis Gearon
Wow, would you put a diagram somewhere up on the Solr site?

Or, here, and I will put it somewhere there.

And, what is a VIP?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Upayavira 
To: "solr-user@lucene.apache.org" 
Sent: Tue, November 30, 2010 4:39:40 PM
Subject: Re: distributed architecture

I cannot say how mature the code for B) is, but it is not yet included
in a release.

If you want the ability to distribute content across multiple nodes (due
to volume) and want resilience, then use both.

I've had one setup where we have two master servers, each with four
cores. Then we have two pairs of slaves. Each pair mirrors the masters,
so we have two hosts covering each of our cores.

Then comes the complicated bit to explain...

Each of these four slave hosts had a core that was configured with a
hardwired "shards" request parameter, which pointed to each of our
shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs
then balanced across each of our pair of hosts.

Then, put all four of these servers behind another VIP, and we had a
single address we could push requests to, for sharded, and resilient
search.

Now if that doesn't make any sense, let me know and I'll have another go
at explaining it (or even attempt a diagram).

Upayavira

On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)"
 wrote:
> Hi,
> I'd like to know if anybody has suggestions/opinions on what is currently 
>the best architecture for a distributed search system using Solr. The use case 
>is that of a system composed
> of N indexes, each hosted on a separate machine, each index containing
> unique content.
> 
> Options that I know of are:
> 
> A) Using Solr distributed search
> B) Using Solr + Zookeeper integration
> C) Using replication, i.e. each node replicates all the others
> 
> It seems like options A) and B) would suffer from a fault-tolerance
> standpoint: if any of the nodes goes down, the search won't -at this
> time- return partial results, but instead report an exception.
> Option C) would provide fault tolerance, at least for any search
> initiated at a node that is available, but would incur into a large
> replication overhead.
> 
> Did I get any of the above wrong, or does somebody have some insight on
> what is the best system architecture for this use case ?
> 
> thanks in advance,
> Luca



RE: distributed architecture

2010-11-30 Thread Jayant Das

Hi, A diagram will be very much appreciated.
Thanks,
Jayant
 
> From: u...@odoko.co.uk
> To: solr-user@lucene.apache.org
> Subject: Re: distributed architecture
> Date: Wed, 1 Dec 2010 00:39:40 +
> 
> I cannot say how mature the code for B) is, but it is not yet included
> in a release.
> 
> If you want the ability to distribute content across multiple nodes (due
> to volume) and want resilience, then use both.
> 
> I've had one setup where we have two master servers, each with four
> cores. Then we have two pairs of slaves. Each pair mirrors the masters,
> so we have two hosts covering each of our cores.
> 
> Then comes the complicated bit to explain...
> 
> Each of these four slave hosts had a core that was configured with a
> hardwired "shards" request parameter, which pointed to each of our
> shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs
> then balanced across each of our pair of hosts.
> 
> Then, put all four of these servers behind another VIP, and we had a
> single address we could push requests to, for sharded, and resilient
> search.
> 
> Now if that doesn't make any sense, let me know and I'll have another go
> at explaining it (or even attempt a diagram).
> 
> Upayavira
> 
> On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)"
>  wrote:
> > Hi,
> > I'd like to know if anybody has suggestions/opinions on what is currently 
> > the best architecture for a distributed search system using Solr. The use 
> > case is that of a system composed
> > of N indexes, each hosted on a separate machine, each index containing
> > unique content.
> > 
> > Options that I know of are:
> > 
> > A) Using Solr distributed search
> > B) Using Solr + Zookeeper integration
> > C) Using replication, i.e. each node replicates all the others
> > 
> > It seems like options A) and B) would suffer from a fault-tolerance
> > standpoint: if any of the nodes goes down, the search won't -at this
> > time- return partial results, but instead report an exception.
> > Option C) would provide fault tolerance, at least for any search
> > initiated at a node that is available, but would incur into a large
> > replication overhead.
> > 
> > Did I get any of the above wrong, or does somebody have some insight on
> > what is the best system architecture for this use case ?
> > 
> > thanks in advance,
> > Luca
  

Re: distributed architecture

2010-11-30 Thread Shawn Heisey

On 11/30/2010 2:27 PM, Cinquini, Luca (3880) wrote:

Hi,
I'd like to know if anybody has suggestions/opinions on what is 
currently the best architecture for a distributed search system using Solr. The 
use case is that of a system composed
of N indexes, each hosted on a separate machine, each index containing unique 
content.

Options that I know of are:

A) Using Solr distributed search
B) Using Solr + Zookeeper integration
C) Using replication, i.e. each node replicates all the others

It seems like options A) and B) would suffer from a fault-tolerance standpoint: 
if any of the nodes goes down, the search won't -at this time- return partial 
results, but instead report an exception.
Option C) would provide fault tolerance, at least for any search initiated at a 
node that is available, but would incur into a large replication overhead.


Exactly what will work best for you is highly dependent on your specific 
requirements.  Answers to the following questions will influence the 
choice.  If you choose distributed, they will also affect how you 
distribute the data among the different machines:


* How many documents?
* How much disk space will the index take up?
* Do you need uniform IDF across the entire corpus?
* How often do you need to insert new content?
* How often do you need to delete old content?
* What query volume do you have to support?

For my index, I use distributed+replicated, for redundancy.  Statistics:

* Low query volume.
* 49 million documents
* Total index size: 87GB (using 1024, not 1000)
* Adding about 40-5 new documents every day
* Inserts done every two minutes.
* Deletes done every ten minutes.
* Six shards with static content.  Each one is 14GB and 8+ million 
documents.
* One shard with content less than a week old.  It is usually less than 
1GB in size.


The entire system consists of 18 virtual machines on four physical 
hosts.  There is a master VMs and a slave VM for each shard.  The static 
shard VMs have 9GB of RAM, the recent content shard has 3GB of RAM.  Two 
VMs with 1.5GB of RAM are Solr instances that do not have indexes, they 
are used as search brokers.  Two VMs are load balancers, running 
heartbeat and HAProxy.  The load balancers present the two search 
brokers as a single IP address.  Each of the VMs in a pair is running on 
a different physical host.


By low query volume, I mean that the average queries per second is less 
than 1, but this is over a long time period, so it includes nights and 
weekends when there is very little traffic.  Even during heavy times, I 
would estimate that the queries per second is still single digit.  Over 
the last 11 days and 21 hours, the website made 320,000 queries.  In the 
same time period, there were 340,000 load balancer health-check queries, 
at a rate of two every five seconds.


Shawn



Re: distributed architecture

2010-11-30 Thread Upayavira
I cannot say how mature the code for B) is, but it is not yet included
in a release.

If you want the ability to distribute content across multiple nodes (due
to volume) and want resilience, then use both.

I've had one setup where we have two master servers, each with four
cores. Then we have two pairs of slaves. Each pair mirrors the masters,
so we have two hosts covering each of our cores.

Then comes the complicated bit to explain...

Each of these four slave hosts had a core that was configured with a
hardwired "shards" request parameter, which pointed to each of our
shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs
then balanced across each of our pair of hosts.

Then, put all four of these servers behind another VIP, and we had a
single address we could push requests to, for sharded, and resilient
search.

Now if that doesn't make any sense, let me know and I'll have another go
at explaining it (or even attempt a diagram).

Upayavira

On Tue, 30 Nov 2010 13:27 -0800, "Cinquini, Luca (3880)"
 wrote:
> Hi,
>   I'd like to know if anybody has suggestions/opinions on what is 
> currently the best architecture for a distributed search system using Solr. 
> The use case is that of a system composed
> of N indexes, each hosted on a separate machine, each index containing
> unique content.
> 
> Options that I know of are:
> 
> A) Using Solr distributed search
> B) Using Solr + Zookeeper integration
> C) Using replication, i.e. each node replicates all the others
> 
> It seems like options A) and B) would suffer from a fault-tolerance
> standpoint: if any of the nodes goes down, the search won't -at this
> time- return partial results, but instead report an exception.
> Option C) would provide fault tolerance, at least for any search
> initiated at a node that is available, but would incur into a large
> replication overhead.
> 
> Did I get any of the above wrong, or does somebody have some insight on
> what is the best system architecture for this use case ?
> 
> thanks in advance,
> Luca


distributed architecture

2010-11-30 Thread Cinquini, Luca (3880)
Hi,
I'd like to know if anybody has suggestions/opinions on what is 
currently the best architecture for a distributed search system using Solr. The 
use case is that of a system composed
of N indexes, each hosted on a separate machine, each index containing unique 
content.

Options that I know of are:

A) Using Solr distributed search
B) Using Solr + Zookeeper integration
C) Using replication, i.e. each node replicates all the others

It seems like options A) and B) would suffer from a fault-tolerance standpoint: 
if any of the nodes goes down, the search won't -at this time- return partial 
results, but instead report an exception.
Option C) would provide fault tolerance, at least for any search initiated at a 
node that is available, but would incur into a large replication overhead.

Did I get any of the above wrong, or does somebody have some insight on what is 
the best system architecture for this use case ?

thanks in advance,
Luca

Re: Distributed Architecture

2009-05-26 Thread Otis Gospodnetic

4 big indices if those servers can handle them.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: "Woytowitz, Matthew" 
> To: solr-user@lucene.apache.org
> Sent: Tuesday, May 26, 2009 1:01:48 PM
> Subject: Distributed Architecture
> 
> I have 4 servers each with 8 cores and 32 gigs of ram with 2TB of SAN
> space for each server.  We want to distribute the index across the 4
> servers using shards. 
> 
> 
> 
> What would be better:
> 
> 
> 
> 4 big indexes.  One on each server.  4 total shards, one per server.
> 
> 
> 
> X number of smaller indexes, X on each server.  4X total shards, X per
> server.
> 
> How should I divide this up?  Separate Java process with separate
> solrconfig dirs or one big 64-bit JVM with separate cores.
> 
> 
> 
> Anyone done anything like this?
> 
> 
> 
> Thanks,
> 
> 
> 
> Matt Woytowitz



Distributed Architecture

2009-05-26 Thread Woytowitz, Matthew
I have 4 servers each with 8 cores and 32 gigs of ram with 2TB of SAN
space for each server.  We want to distribute the index across the 4
servers using shards. 

 

What would be better:

 

4 big indexes.  One on each server.  4 total shards, one per server.

 

X number of smaller indexes, X on each server.  4X total shards, X per
server.

How should I divide this up?  Separate Java process with separate
solrconfig dirs or one big 64-bit JVM with separate cores.

 

Anyone done anything like this?

 

Thanks,

 

Matt Woytowitz