Re: Ok to share ZK nodes with Hadoop nodes?

2010-03-08 Thread Ted Dunning
I have used 5 and 3 in different clusters.  Moderate amounts of sharing is
reasonable, but sharing with less intensive applications is definitely
better.  Sharing with the job tracker, for instance is likely fine since it
doesn't abuse disk so much.  The namenode is similar, but not quite as
nice.  Sharing with task only nodes is better than sharing with data nodes.

If your hadoop cluster is 10 machines, this is probably pretty serious
overhead.  If it is 200 machines, it is much less so.

If you are running in EC2, then spawning 3 extra small instances is not a
big deal.

For the record, we share our production ZK machines with other tasks, but
not with map-reduce related tasks and not with our production search
engines.

On Mon, Mar 8, 2010 at 11:21 AM, Patrick Hunt  wrote:

> Best practice for "on-line production serving" is 5 dedicated hosts with
> "shared nothing", physically distributed thoughout the data center (5 hosts
> in a rack might not be the best idea for super reliability). There's alot of
> lee-way though, many ppl run with 3 and spof on switch for example.
>


Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Patrick Hunt
That's controlled by the "tickTime"/synclimit/initlimit/etc.. see more 
about this in the admin guide: http://bit.ly/c726DC


You'll want to increase from the defaults since those are typically for 
high performance interconnect (ie within colo). You are correct though, 
much will depend on your env. and some tuning will be involved.


Patrick

Martin Waite wrote:

Hi Patrick,

Thanks for you input.

I am planning on having 3 zk servers per data centre, with perhaps only 2 in
the tie-breaker site.

The traffic between zk and the applications will be lots of local reads -
"who is the primary database ?".  Changes to the config will be rare (server
rebuilds, etc - ie. planned changes) or caused by server / network / site
failure.

The interesting thing in my mind is how zookeeper will cope with inter-site
link failure - how quickly the remote sites will notice, and how quickly
normality can be resumed when the link reappears.

I need to get this running in the lab and start pulling out wires.

regards,
Martin

On 8 March 2010 17:39, Patrick Hunt  wrote:


IMO latency is the primary issue you will face, but also keep in mind
reliability w/in a colo.

Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in
each colo, you will be reliable but clients w/in each colo will have to
connect to a remote colo if the local fails. You will want to prioritize the
local colo given that reads can be serviced entirely local that way. If you
have 7 servers (2-2-3) that would be better - if a local server fails you
have a redundant, if both fail then you go remote.

You want to keep your writes as few as possible and as small as possible?
Why? Say you have 100ms latency btw colos, let's go through a scenario for a
client in a colo where the local servers are not the leader (zk cluster
leader).

read:
1) client reads a znode from local server
2) local server (usually < 1ms if "in colo" comm) responds in 1ms

write:
1) client writes a znode to local server A
2) A proposes change to the ZK Leader (L) in remote colo
3) L gets the proposal in 100ms
4) L proposes the change to all followers
5) all followers (not exactly, but hopefully) get the proposal in 100ms
6) followers ack the change
7) L gets the acks in 100ms
8) L commits the change (message to all followers)
9) A gets the commit in 100ms
10) A responds to client (< 1ms)

write latency: 100 + 100 + 100 + 100 = 400ms

Obviously keeping these writes small is also critical.

Patrick


Martin Waite wrote:


Hi Ted,

If the links do not work for us for zk, then they are unlikely to work
with
any other solution - such as trying to stretch Pacemaker or Red Hat
Cluster
with their multicast protocols across the links.

If the links are not good enough, we might have to spend some more money
to
fix this.

regards,
Martin

On 8 March 2010 02:14, Ted Dunning  wrote:

 If you can stand the latency for updates then zk should work well for

you.
It is unlikely that you will be able to better than zk does and still
maintain correctness.

Do note that you can, probalbly bias client to use a local server. That
should make things more efficient.

Sent from my iPhone


On Mar 7, 2010, at 3:00 PM, Mahadev Konar  wrote:

 The inter-site links are a nuisance.  We have two data-centres with
100Mb


links which I hope would be good enough for most uses, but we need a 3rd

site - and currently that only has 2Mb links to the other sites.  This
might
be a problem.






Re: Ok to share ZK nodes with Hadoop nodes?

2010-03-08 Thread David Rosenstrauch

On 03/08/2010 02:21 PM, Patrick Hunt wrote:

See the troubleshooting page, some apropos detail there (esp relative to
virtual env).

http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

ZK servers are sensitive to IO (disk/network) latency. As long as you
aren't very sensitive latency requirements it should be fine. If the
machine were to swap for example, or the JVM were to go into long term
GC (visualization in particular kills jvm gc) that would be bad.

Best practice for "on-line production serving" is 5 dedicated hosts with
"shared nothing", physically distributed thoughout the data center (5
hosts in a rack might not be the best idea for super reliability).
There's alot of lee-way though, many ppl run with 3 and spof on switch
for example.

Patrick


Thanks much for the advice, Patrick.  (And Mahadev.)

DR


Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Mahadev Konar
HI Martin,
  The results would be really nice information to have on ZooKeeper wiki.
Would be very helpful for others considering the same kind of deployment.
So, do send out your results on the list.


Thanks
mahadev


On 3/8/10 11:18 AM, "Martin Waite"  wrote:

> Hi Patrick,
> 
> Thanks for you input.
> 
> I am planning on having 3 zk servers per data centre, with perhaps only 2 in
> the tie-breaker site.
> 
> The traffic between zk and the applications will be lots of local reads -
> "who is the primary database ?".  Changes to the config will be rare (server
> rebuilds, etc - ie. planned changes) or caused by server / network / site
> failure.
> 
> The interesting thing in my mind is how zookeeper will cope with inter-site
> link failure - how quickly the remote sites will notice, and how quickly
> normality can be resumed when the link reappears.
> 
> I need to get this running in the lab and start pulling out wires.
> 
> regards,
> Martin
> 
> On 8 March 2010 17:39, Patrick Hunt  wrote:
> 
>> IMO latency is the primary issue you will face, but also keep in mind
>> reliability w/in a colo.
>> 
>> Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in
>> each colo, you will be reliable but clients w/in each colo will have to
>> connect to a remote colo if the local fails. You will want to prioritize the
>> local colo given that reads can be serviced entirely local that way. If you
>> have 7 servers (2-2-3) that would be better - if a local server fails you
>> have a redundant, if both fail then you go remote.
>> 
>> You want to keep your writes as few as possible and as small as possible?
>> Why? Say you have 100ms latency btw colos, let's go through a scenario for a
>> client in a colo where the local servers are not the leader (zk cluster
>> leader).
>> 
>> read:
>> 1) client reads a znode from local server
>> 2) local server (usually < 1ms if "in colo" comm) responds in 1ms
>> 
>> write:
>> 1) client writes a znode to local server A
>> 2) A proposes change to the ZK Leader (L) in remote colo
>> 3) L gets the proposal in 100ms
>> 4) L proposes the change to all followers
>> 5) all followers (not exactly, but hopefully) get the proposal in 100ms
>> 6) followers ack the change
>> 7) L gets the acks in 100ms
>> 8) L commits the change (message to all followers)
>> 9) A gets the commit in 100ms
>> 10) A responds to client (< 1ms)
>> 
>> write latency: 100 + 100 + 100 + 100 = 400ms
>> 
>> Obviously keeping these writes small is also critical.
>> 
>> Patrick
>> 
>> 
>> Martin Waite wrote:
>> 
>>> Hi Ted,
>>> 
>>> If the links do not work for us for zk, then they are unlikely to work
>>> with
>>> any other solution - such as trying to stretch Pacemaker or Red Hat
>>> Cluster
>>> with their multicast protocols across the links.
>>> 
>>> If the links are not good enough, we might have to spend some more money
>>> to
>>> fix this.
>>> 
>>> regards,
>>> Martin
>>> 
>>> On 8 March 2010 02:14, Ted Dunning  wrote:
>>> 
>>>  If you can stand the latency for updates then zk should work well for
 you.
 It is unlikely that you will be able to better than zk does and still
 maintain correctness.
 
 Do note that you can, probalbly bias client to use a local server. That
 should make things more efficient.
 
 Sent from my iPhone
 
 
 On Mar 7, 2010, at 3:00 PM, Mahadev Konar  wrote:
 
  The inter-site links are a nuisance.  We have two data-centres with
 100Mb
 
> links which I hope would be good enough for most uses, but we need a 3rd
>> site - and currently that only has 2Mb links to the other sites.  This
>> might
>> be a problem.
>> 
>> 
>>> 



Re: Ok to share ZK nodes with Hadoop nodes?

2010-03-08 Thread Patrick Hunt
See the troubleshooting page, some apropos detail there (esp relative to 
virtual env).


http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

ZK servers are sensitive to IO (disk/network) latency. As long as you 
aren't very sensitive latency requirements it should be fine. If the 
machine were to swap for example, or the JVM were to go into long term 
GC (visualization in particular kills jvm gc) that would be bad.


Best practice for "on-line production serving" is 5 dedicated hosts with 
"shared nothing", physically distributed thoughout the data center (5 
hosts in a rack might not be the best idea for super reliability). 
There's alot of lee-way though, many ppl run with 3 and spof on switch 
for example.


Patrick

David Rosenstrauch wrote:
I'm contemplating an upcoming zookeeper rollout and was wondering what 
the zookeeper brain trust here thought about a network deployment question:


Is it generally considered bad practice to just deploy zookeeper on our 
existing hdfs/MR nodes?  Or is it better to run zookeeper instances on 
their own dedicated nodes?


On the one hand, we're not going to be making heavy-duty use of 
zookeeper, so it might be sufficient for zookeeper nodes to share box 
resources with HDFS & MR.  On the other hand, though, I don't want 
zookeeper to become unavailable if the nodes are running a resource 
intensive job that's hogging CPU or network.



What's generally considered best practice for Zookeeper?

Thanks,

DR


Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Martin Waite
Hi Patrick,

Thanks for you input.

I am planning on having 3 zk servers per data centre, with perhaps only 2 in
the tie-breaker site.

The traffic between zk and the applications will be lots of local reads -
"who is the primary database ?".  Changes to the config will be rare (server
rebuilds, etc - ie. planned changes) or caused by server / network / site
failure.

The interesting thing in my mind is how zookeeper will cope with inter-site
link failure - how quickly the remote sites will notice, and how quickly
normality can be resumed when the link reappears.

I need to get this running in the lab and start pulling out wires.

regards,
Martin

On 8 March 2010 17:39, Patrick Hunt  wrote:

> IMO latency is the primary issue you will face, but also keep in mind
> reliability w/in a colo.
>
> Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in
> each colo, you will be reliable but clients w/in each colo will have to
> connect to a remote colo if the local fails. You will want to prioritize the
> local colo given that reads can be serviced entirely local that way. If you
> have 7 servers (2-2-3) that would be better - if a local server fails you
> have a redundant, if both fail then you go remote.
>
> You want to keep your writes as few as possible and as small as possible?
> Why? Say you have 100ms latency btw colos, let's go through a scenario for a
> client in a colo where the local servers are not the leader (zk cluster
> leader).
>
> read:
> 1) client reads a znode from local server
> 2) local server (usually < 1ms if "in colo" comm) responds in 1ms
>
> write:
> 1) client writes a znode to local server A
> 2) A proposes change to the ZK Leader (L) in remote colo
> 3) L gets the proposal in 100ms
> 4) L proposes the change to all followers
> 5) all followers (not exactly, but hopefully) get the proposal in 100ms
> 6) followers ack the change
> 7) L gets the acks in 100ms
> 8) L commits the change (message to all followers)
> 9) A gets the commit in 100ms
> 10) A responds to client (< 1ms)
>
> write latency: 100 + 100 + 100 + 100 = 400ms
>
> Obviously keeping these writes small is also critical.
>
> Patrick
>
>
> Martin Waite wrote:
>
>> Hi Ted,
>>
>> If the links do not work for us for zk, then they are unlikely to work
>> with
>> any other solution - such as trying to stretch Pacemaker or Red Hat
>> Cluster
>> with their multicast protocols across the links.
>>
>> If the links are not good enough, we might have to spend some more money
>> to
>> fix this.
>>
>> regards,
>> Martin
>>
>> On 8 March 2010 02:14, Ted Dunning  wrote:
>>
>>  If you can stand the latency for updates then zk should work well for
>>> you.
>>> It is unlikely that you will be able to better than zk does and still
>>> maintain correctness.
>>>
>>> Do note that you can, probalbly bias client to use a local server. That
>>> should make things more efficient.
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Mar 7, 2010, at 3:00 PM, Mahadev Konar  wrote:
>>>
>>>  The inter-site links are a nuisance.  We have two data-centres with
>>> 100Mb
>>>
 links which I hope would be good enough for most uses, but we need a 3rd
> site - and currently that only has 2Mb links to the other sites.  This
> might
> be a problem.
>
>
>>


Re: Ok to share ZK nodes with Hadoop nodes?

2010-03-08 Thread Mahadev Konar
Hi David,
  Sharing the cluster with HDFS and Map reduce might cause significant
problems. Mapreduce is very IO intensive and this might cause lot of
unnecessary hiccups in your cluster. I would suggest atleast providing
something like this, if you really want to share the nodes.

- atleast considerable amount of memory space say 400-500MB (depending on
your usage) for the java heap
- one dedicated disk not used by MR or Datanodes, so that ZooKeeper
performance is a little predictable for you.

Thanks
mahadev


On 3/8/10 10:58 AM, "David Rosenstrauch"  wrote:

> I'm contemplating an upcoming zookeeper rollout and was wondering what
> the zookeeper brain trust here thought about a network deployment question:
> 
> Is it generally considered bad practice to just deploy zookeeper on our
> existing hdfs/MR nodes?  Or is it better to run zookeeper instances on
> their own dedicated nodes?
> 
> On the one hand, we're not going to be making heavy-duty use of
> zookeeper, so it might be sufficient for zookeeper nodes to share box
> resources with HDFS & MR.  On the other hand, though, I don't want
> zookeeper to become unavailable if the nodes are running a resource
> intensive job that's hogging CPU or network.
> 
> 
> What's generally considered best practice for Zookeeper?
> 
> Thanks,
> 
> DR



Ok to share ZK nodes with Hadoop nodes?

2010-03-08 Thread David Rosenstrauch
I'm contemplating an upcoming zookeeper rollout and was wondering what 
the zookeeper brain trust here thought about a network deployment question:


Is it generally considered bad practice to just deploy zookeeper on our 
existing hdfs/MR nodes?  Or is it better to run zookeeper instances on 
their own dedicated nodes?


On the one hand, we're not going to be making heavy-duty use of 
zookeeper, so it might be sufficient for zookeeper nodes to share box 
resources with HDFS & MR.  On the other hand, though, I don't want 
zookeeper to become unavailable if the nodes are running a resource 
intensive job that's hogging CPU or network.



What's generally considered best practice for Zookeeper?

Thanks,

DR


Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Patrick Hunt
IMO latency is the primary issue you will face, but also keep in mind 
reliability w/in a colo.


Say you have 3 colos (obv can't be 2), if you only have 3 servers, one 
in each colo, you will be reliable but clients w/in each colo will have 
to connect to a remote colo if the local fails. You will want to 
prioritize the local colo given that reads can be serviced entirely 
local that way. If you have 7 servers (2-2-3) that would be better - if 
a local server fails you have a redundant, if both fail then you go remote.


You want to keep your writes as few as possible and as small as 
possible? Why? Say you have 100ms latency btw colos, let's go through a 
scenario for a client in a colo where the local servers are not the 
leader (zk cluster leader).


read:
1) client reads a znode from local server
2) local server (usually < 1ms if "in colo" comm) responds in 1ms

write:
1) client writes a znode to local server A
2) A proposes change to the ZK Leader (L) in remote colo
3) L gets the proposal in 100ms
4) L proposes the change to all followers
5) all followers (not exactly, but hopefully) get the proposal in 100ms
6) followers ack the change
7) L gets the acks in 100ms
8) L commits the change (message to all followers)
9) A gets the commit in 100ms
10) A responds to client (< 1ms)

write latency: 100 + 100 + 100 + 100 = 400ms

Obviously keeping these writes small is also critical.

Patrick

Martin Waite wrote:

Hi Ted,

If the links do not work for us for zk, then they are unlikely to work with
any other solution - such as trying to stretch Pacemaker or Red Hat Cluster
with their multicast protocols across the links.

If the links are not good enough, we might have to spend some more money to
fix this.

regards,
Martin

On 8 March 2010 02:14, Ted Dunning  wrote:


If you can stand the latency for updates then zk should work well for you.
It is unlikely that you will be able to better than zk does and still
maintain correctness.

Do note that you can, probalbly bias client to use a local server. That
should make things more efficient.

Sent from my iPhone


On Mar 7, 2010, at 3:00 PM, Mahadev Konar  wrote:

 The inter-site links are a nuisance.  We have two data-centres with 100Mb

links which I hope would be good enough for most uses, but we need a 3rd
site - and currently that only has 2Mb links to the other sites.  This
might
be a problem.





Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Martin Waite
Hi Ted,

If the links do not work for us for zk, then they are unlikely to work with
any other solution - such as trying to stretch Pacemaker or Red Hat Cluster
with their multicast protocols across the links.

If the links are not good enough, we might have to spend some more money to
fix this.

regards,
Martin

On 8 March 2010 02:14, Ted Dunning  wrote:

> If you can stand the latency for updates then zk should work well for you.
> It is unlikely that you will be able to better than zk does and still
> maintain correctness.
>
> Do note that you can, probalbly bias client to use a local server. That
> should make things more efficient.
>
> Sent from my iPhone
>
>
> On Mar 7, 2010, at 3:00 PM, Mahadev Konar  wrote:
>
>  The inter-site links are a nuisance.  We have two data-centres with 100Mb
>>> links which I hope would be good enough for most uses, but we need a 3rd
>>> site - and currently that only has 2Mb links to the other sites.  This
>>> might
>>> be a problem.
>>>
>>


Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Martin Waite
Hi Mahadev,

Thanks again for your insight.

I will no doubt be in touch to let you know how this works out.

regards,
Martin

On 7 March 2010 23:00, Mahadev Konar  wrote:

> Martin,
>  2Mb link might certainly be a problem. We can refer to these nodes as
> ZooKeeper servers. Znodes is used to data elements in the ZooKeeper data
> tree.
>
> The Zookeeper ensemble has minimal traffic which is basically health checks
> between the members of the ensemble. We call one of the members as Leader
> who is leading the ensemble and the others as Followers. The Leader does
> periodic health checks to see if the Followers are doing fine. This is of
> the order of << 1KB/sec.
>
> There is some traffic when the leader election within the ensemble happens.
> This might be of the order of 1-2KB/sec.
>
> As you mentioned the reads happen locally. So, a good enough link within
> the
> ensemble members is important so that these followers can be up to date
> with
> the Leader. But again looking at your config, looks like its mostly read
> only traffic.
>
> One more thing you should be aware of:
> Lets says a ephemeral node was created and the client died, then the
> clients
> connected to the slow ZooKeeper server (with 2Mb/s links) would lag behind
> the other clients connected to the other servers.
>
> As per my opinion you should do some testing since 2Mb/sec seems a little
> dodgy.
>
> Thanks
> mahadev
>
> On 3/7/10 2:09 PM, "Martin Waite"  wrote:
>
> > Hi Mahadev,
> >
> > The inter-site links are a nuisance.  We have two data-centres with 100Mb
> > links which I hope would be good enough for most uses, but we need a 3rd
> > site - and currently that only has 2Mb links to the other sites.  This
> might
> > be a problem.
> >
> > The ensemble would have a lot of read traffic from applications asking
> which
> > database to connect to for each transaction - which presumably would be
> > mostly handled by local zookeeper servers (do we call these "nodes" as
> > opposed to znodes ?).  The write traffic would be mostly changes to
> > configuration (a rare event), and changes in the health of database
> servers
> > - also hopefully rare.  I suppose the main concern is how much ambient
> > zookeeper system chatter will cross the links.   Are there any
> measurements
> > of how much traffic is used by zookeeper in maintaining the ensemble ?
> >
> > Another question that occurs is whether I can link sites A,B, and C in a
> > ring - so that if any one site drops out, the remaining 2 continue to
> talk.
> > I suppose that if the zookeeper servers are all in direct contact with
> each
> > other, this issue does not exist.
> >
> > regards,
> > Martin
> >
> > On 7 March 2010 21:43, Mahadev Konar  wrote:
> >
> >> Hi Martin,
> >>  As Ted rightly mentions that ZooKeeper usually is run within a colo
> >> because
> >> of the low latency requirements of applications that it supports.
> >>
> >> Its definitely reasnoble to use it in a multi data center environments
> but
> >> you should realize the implications of it. The high latency/low
> throughput
> >> means that you should make minimal use of such a ZooKeeper ensemble.
> >>
> >> Also, there are things like the tick Time, the syncLimit and others
> (setup
> >> parameters for ZooKeeper in config) which you will need to tune a little
> to
> >> get ZooKeeper running without many hiccups in this environment.
> >>
> >> Thanks
> >> mahadev
> >>
> >>
> >> On 3/6/10 10:29 AM, "Ted Dunning"  wrote:
> >>
> >>> What you describe is relatively reasonable, even though Zookeeper is
> not
> >>> normally distributed across multiple data centers with all members
> >> getting
> >>> full votes.  If you account for the limited throughput that this will
> >> impose
> >>> on your applications that use ZK, then I think that this can work well.
> >>> Probably, you would have local ZK clusters for higher transaction rate
> >>> applications.
> >>>
> >>> You should also consider very carefully whether having multiple data
> >> centers
> >>> increases or decreases your overall reliability.  Unless you design
> very
> >>> carefully, this will normally substantially degrade reliability.
>  Making
> >>> sure that it increases reliability is a really big task that involves a
> >> lot
> >>> of surprising (it was to me) considerations and considerable hardware
> and
> >>> time investments.
> >>>
> >>> Good luck!
> >>>
> >>> On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite  >>> wrote:
> >>>
>  Is this a viable approach, or am I taking Zookeeper out of its
> >> application
>  domain and just asking for trouble ?
> 
> >>>
> >>>
> >>
> >>
>
>