Re: problem in deploying zookeeper ensemble

2010-09-23 Thread Henry Robinson
ocalhost:2181):clientcnxn$sendthr...@1000] - Opening
> socket connection to server localhost/0:0:0:0:0:0:0:1:2181
> 2010-09-23 12:53:28,248 - WARN  [main-SendThread(localhost:
> 2181):clientcnxn$sendthr...@1120] - Session 0x0 for server null,
> unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
> 574)
>at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:
> 1078)
> 2010-09-23 12:53:29,866 - INFO  [main-SendThread(localhost:
> 2181):clientcnxn$sendthr...@1000] - Opening socket connection to
> server localhost/127.0.0.1:2181
> 2010-09-23 12:53:29,868 - WARN  [main-SendThread(localhost:
> 2181):clientcnxn$sendthr...@1120] - Session 0x0 for server null,
> unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
> 574)
>at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:
> 1078)
>
> what  im able to find here that zookeeper client tring to connect to
> master at : 2181
> but according to my configuration all server are listening at...
>
> server.1  clientPort=2184
> server.2  clientPort=2185
> server.3  clientPort=2186
> .
> .
> .
> means that it is using defaut port:2181
>
> Can anybody tell me What is exactly prolem ?
> Is client process is unable to find our configuration ?
>
>
> Thanks .
> Sanjiv Singh ( iLabs)
> Impetus Infotech (India).
> Mob :+091-9990-447-339
>
>
>
>


-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Zookeeper CLI

2010-09-07 Thread Henry Robinson
Are you running ZK on the standard ports (in particular, client port 2181)?
Can you telnet into those servers on that port and issue an ruok command?

Henry

On 7 September 2010 16:48, Avinash Lakshman wrote:

> I have a 5 server ZK cluster. I want to connect to it using the CLI from
> some remote machine. Is there any particular set up that I need to connect?
> I am running it as zkCli.sh -server  where the server name is
> one of the servers in the ZK cluster. Is this correct? I can get a string
> of
> Connect Exceptions as shown below:
>
> 2010-09-07 16:46:07,955 - WARN
>  [main-SendThread(msgzkapp001.ash2.facebook.com:2181
> ):clientcnxn$sendthr...@1120] - Session 0x0 for server null, unexpected
> error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:672)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
> 2010-09-07 16:46:09,141 - INFO
>  [main-SendThread(msgzkapp001.ash2.facebook.com:2181
> ):clientcnxn$sendthr...@1000] - Opening socket connection to server
> msgzkapp001.ash2.facebook.com/10.138.31.220:2181
> 2010-09-07 16:46:09,220 - WARN
>  [main-SendThread(msgzkapp001.ash2.facebook.com:2181
> ):clientcnxn$sendthr...@1120] - Session 0x0 for server null, unexpected
> error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:672)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
>
>
> Cheers
> Avinash
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Zookeeper shell

2010-08-31 Thread Henry Robinson
I'm not sure if reverse search is something you get with jline... if not,
that would be a great patch :)

On 31 August 2010 16:00, Michi Mutsuzaki  wrote:

> Great! Just what I was looking for.
>
> Thanks!
> --Michi
>
> On 8/31/10 3:30 PM, "Patrick Hunt"  wrote:
>
> Depending on your classpath setup:
>
> java org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181
>
> if jline jar is in your classpath (included in the zk release
> distribution) you'll get history, auto-complete and such.
>
> Patrick
>
> On 08/31/2010 03:08 PM, Michi Mutsuzaki wrote:
> > Hello,
> >
> > I'm looking for a good zookeeper shell. So far I've only used cli_mt (c
> > client), but it's not very user friendly. Are there any alternatives? In
> > particular, I'm looking for:
> >
> > - command history with reverse search
> > - auto-complete znode path
> >
> > Thanks!
> > --Michi
> >
>
>


-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: What roles do "even" nodes play in the ensamble

2010-08-25 Thread Henry Robinson
Todd -

No, this is not the case. There are no 'backup' or 'failover' nodes in
ZooKeeper. All servers that can vote are working as part of the cluster
until they fail. You need a majority of your voting servers alive.

If you have three servers, a majority is of size two. The number of nodes
that can fail before a majority is no longer alive is one.
If you have four servers, a majority is of size three. The number of nodes
that can fail before a majority is no longer alive is one.
If you have five servers, a majority is of size three. The number of nodes
that can fail before a majority is no longer alive is two.

This is why four servers is worse than three for availability. In both
cases, two servers have to fail before the cluster is no longer available.
However if failures are independently distributed, this is more likely to
happen in a cluster of four nodes than a cluster of three (think of it as
'more things available to go wrong').

If you have four servers and one dies, the 'majority' that still needs to be
alive is still three - it doesn't drop down to two. The majority is of all
voting servers, alive or dead.

Hope this helps -

Henry

On 25 August 2010 21:01, Todd Nine  wrote:

> Thanks Dave.  I've been using Cassandra, so I'm trying to get my head
> around the configuration/operational differences with ZK.  You state
> that using 4 would actually decrease my reliability.  Can you explain
> that further?  I was under the impression that a 4th node would act as a
> non voting read only node until one of the other 3 fails.  I thought
> that this extra node would give me some breathing room by allowing any
> node to fail and still have 3 voting nodes.  Is this not the case?
>
> Thanks,
>
> Todd
>
>
>
>
> On Wed, 2010-08-25 at 21:13 -0600, Ted Dunning wrote:
>
> > Just use 3 nodes.  Life will be better.
> >
> >
> >
> > You can configure the fourth node in the event of one of the first
> > three failing and bring it on line.  Then you can re-configure and
> > restart each of the others one at a time.  This gives you flexibility
> > because you have 4 nodes, but doesn't decrease your reliability the
> > way that using a four node cluster would.  If you need to do
> > maintenance on one node, just configure that node out as if it had
> > failed.
> >
> >
> > On Wed, Aug 25, 2010 at 4:26 PM, Dave Wright 
> > wrote:
> >
> > You can certainly serve more reads with a 4th node, but I'm
> > not sure
> > what you mean by "it won't have a voting role". It still
> > participates
> > in voting for leaders as do all non-observers regardless of
> > whether it
> > is an even or odd number. With zookeeper there is no voting on
> > each
> > transaction, only leader changes.
> >
> > -Dave Wright
> >
> >
> >
> > On Wed, Aug 25, 2010 at 6:22 PM, Todd Nine
> >  wrote:
> > > Do I get any read performance increase (similar to an
> > observer) since
> > > the node will not have a voting role?
> > >
> > >
> >
> >
> >
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: What roles do "even" nodes play in the ensamble

2010-08-25 Thread Henry Robinson
Dave is correct - if you have N nodes you need  (N/2) + 1 votes (i.e. a
majority) in the standard case to get a vote to pass.

Adding a fourth voting node to a three node cluster will cause the size of a
majority to jump from 2 to 3. The number of nodes that need to fail before
you can no longer get a majority is 2 in both cases - so you don't get any
reliability for adding a new voting node to a odd-numbered cluster.

The new node will always act as a voter unless you explicitly configure it
as an observer.

Henry

On 25 August 2010 15:11, Dave Wright  wrote:

> I'm not an expert on voting, so there may be a better answer, but from my
> understanding all 4 nodes participate in the voting and you need a majority
> of 3 to elect a leader.
>
> -Dave
>
> On Wed, Aug 25, 2010 at 6:09 PM, Todd Nine 
> wrote:
>
> >  Thanks for that Dave.  If I do not configure it as an observer just a
> > normal member, what will the last even node to join do?
> >
> >
> > 1. Will it participate as a voter on startup?  (I'm assuming not, just
> read
> > only)
> >
> > 2. If one of the voter nodes 1 through 3 dies, does it become a voter?
> >
> >
> >todd
> > SENIOR SOFTWARE ENGINEER
> >
> > todd nine| spidertracks ltd |  117a the square
> > po box 5203 | palmerston north 4441 | new zealand
> > P: +64 6 353 3395 | M: +64 210 255 8576
> > E: t...@spidertracks.co.nz W: www.spidertracks.com
> >
> >
> >
> >
> >
> >   On Wed, 2010-08-25 at 17:57 -0400, Dave Wright wrote:
> >
> > >
> > > 1. When the 4th ZK node joins the cluster, does it take on the observer
> > > role since a quorum cannot be reached with the new node?  Can I still
> > > connect my clients to it and create/remove nodes and receive events?
> >
> > No, it joins as a normal member unless you've configured it as an
> > observer. Note that with 4 nodes you now need 3 running to get a
> > majority, which is why even numbers aren't recommended.
> >
> > >
> > >
> > > 2. In the event 1 of the 3 voting nodes fails, will this 4th node
> become
> > > a voting member of the ensemble?
> >
> > If configured as an observer it remains an observer.
> >
> > >
> > > 3. When a new node comes online, it may have a different ip than the
> > > previous node.  Do I need to update all node configurations and perform
> > > a rolling restart, or will simply connecting the new node to the
> > > existing ensemble make all nodes aware it is running?
> >
> > Unfortunately ZK doesn't have any kind of dynamic configuration like
> > that currently. You need to update all the config files and restart
> > the ensemble.
> >
> > -Dave Wright
> >
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Adding observers

2010-07-21 Thread Henry Robinson
Hi Avinash -

(1) Is it possible to increase the number of observers in the cluster
> dynamically?
>

Not really - you can use the same rolling-restart technique that is often
mentioned on this list for adding servers to the ensemble, but you can't add
them and expect that ZK will auto-find them.


> (2) How many observers can I add given that I will seldom write into the
> cluster but will have a lot of reads coming into the system? Can I run a
> cluster with say 100 observers?
>
>
This will be possible, but there is some overhead in communicating with
observers as well as normal voting followers. However, given that writes are
rare, perhaps this kind of overhead would be acceptable?

Henry


> Any insight would be very helpful.
>
> Thanks
> A
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Errors with Python bindings

2010-07-14 Thread Henry Robinson
Hi Rich -

No, there's not a very easy way to verify the Python bindings version afaik
- would be a useful feature to have though.

My first suggestion is to move to the bindings shipped with 3.3.1 - we fixed
a lot of problems with the Python bindings which improved their stability a
lot. Could you try that and then let us know if you continue to see
problems?

cheers,
Henry

On 14 July 2010 13:14, Rich Schumacher  wrote:

> I'm running a Tornado webserver and using ZooKeeper to store some metadata
> and occasionally the ZooKeeper connection will error out irrevocably.  Any
> subsequent calls to ZooKeeper from this process will result in a
> SystemError.
>
> Here is the relevant portion of the Python traceback:
>  ...
>  File "/usr/lib/pymodules/python2.5/zuul/storage/zoo.py", line 69, in call
>return getattr(zookeeper, name)(self.handle, *args)
> SystemError: NULL result without error in PyObject_Call
>
> I found this in the ZooKeeper server logs:
>
> 2010-07-13 06:52:46,488 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@251] - Accepted socket
> connection from /10.2.128.233:54779
> 2010-07-13 06:52:46,489 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@742] - Client attempting to renew
> session 0x429b865a6270003 at /10.2.128.233:54779
> 2010-07-13 06:52:46,489 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:lear...@95] - Revalidating client: 299973596915630083
> 2010-07-13 06:52:46,793 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1424] - Invalid session
> 0x429b865a6270003 for client /10.2.128.233:54779, probably expired
> 2010-07-13 06:52:46,794 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1286] - Closed socket connection for
> client /10.2.128.233:54779 which had sessionid 0x429b865a6270003
>
>
> The ZooKeeper ensemble is healthy; each node responds as expected to the
> four letter word commands and a simple restart of the Tornado processes
> "fixes" this.
>
> My question is, if this really is due to session expiration why is a
> SessionExpiredException not raised?  Another question, is there an easy way
> to determine the version of the ZooKeeper Python bindings I'm using?  I
> built the 3.3.0 bindings but I just want to be able to verify that.
>
> Thanks for the help,
>
> Rich




-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: zkPython version compatibility

2010-06-11 Thread Henry Robinson
On 11 June 2010 13:44, Lei Zhang  wrote:

> We've been using zkpython with python2.4 for a couple of weeks, banged our
> stability test suite on it in 8-node cluster setting. So far so good.
>
> However, I wouldn't say zkpython 3.3.1 is "much" improved. The SIGABRT,
> segfault, hang issues we used to run into with 3.2.1 now show up as
> exit(1).
>

Sorry to hear that - can you open JIRAs with reproducible test cases? We'll
be glad to try and fix the problems you're having.

Henry


-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: zkPython version compatibility

2010-06-11 Thread Henry Robinson
Hi Daniel -

It's more of the latter, if I recall correctly.

Digging turned up this:
https://issues.apache.org/jira/browse/ZOOKEEPER-579where one of the
tests fails in an python < 2.6 (but fails due to a testing
framework incompatibility I think!), and the README was, perhaps
conservatively, updated to say that only 2.6 is tested.

I haven't personally tested a recent zkpython against <2.6, but I am hopeful
that 2.5 will work correctly. I'd encourage you to try it! The zkpython in
3.3.1 is much improved over earlier versions, so it's worth the upgrade.

cheers,
Henry

On 11 June 2010 12:07, Daniel Thumim  wrote:

> Hello,
>
> While upgrading to the 3.3.1 release, I noticed that zkPython
> now has a stated dependency on python 2.6.  I have been using
> it with python 2.5 until now and expected to continue that for
> at least a few more months.  Is the 2.6 dependency for real,
> or is it just that the maintainer isn't testing older versions
> any more and thus is unsure?
>
> Thanks,
>
> -- |)aniel Thumim
>
>


-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Dynamic adding/removing ZK servers on client

2010-05-03 Thread Henry Robinson
On 3 May 2010 16:40, Dave Wright  wrote:

> > Should this be a znode in the privileged namespace?
> >
>
> I think having a znode for the current cluster members is part of the
> ZOOKEEPER-107 proposal, with the idea being that you could get/set the
> membership just by writing to that node. On the client side, you could
> watch that znode and update your server list when it changes.
>


This is tricky: what happens if the server your client is connected to is
decommissioned by a view change, and you are unable to locate another server
to connect to because other view changes committed while you are
reconnecting have removed all the servers you knew about. We'd need to make
sure that watches on this znode were fired before a view change, but it's
hard to know how to avoid having to wait for a session timeout before a
client that might just be migrating servers reappears in order to make sure
it sees the veiw change.

Even then, the problem of 'locating' the cluster still exists in the case
that there are no clients connected to tell anyone about it.

Henry


-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Question on maintaining leader/membership status in zookeeper

2010-04-30 Thread Henry Robinson
Hi Lei -

The 'user cluster' (by which I think you mean the set of clients of
ZooKeeper?) plays no part in leader election. If a majority of ZooKeeper
server nodes can talk to each other, a new leader can be elected. Clients of
the minority server partition will be disconnected - if they too cannot
reach the majority partition then they will not be able to reconnect.

Hope this helps,
Henry

On 30 April 2010 12:45, Lei Gao  wrote:

> Hi Ted,
>
> I 100% agree with what you said. But my question is more about what if my
> zookeeper service cluster is partitioned from a majority of nodes in my USER
> CLUSTER.  In this case, the majority nodes in one network partition can’t
> select a new leader because zookeeper is out of reach.
>
> Another example will be that if there is an asymmetric network failure
> where a majority of nodes from the USER CLUSTER can’t reach the leader while
> the zookeeper still can. How does zookeeper handle such situation?
>
> Thanks,
>
> Lei
>
> On 4/30/10 12:24 PM, "Ted Dunning"  wrote:
>
> There are a variety of situations that can trigger a new leader election
> and a few that can cause the cluster to be unable to elect a new leader.
>  Isolation of just the leader is one of the situations that will cause a new
> leader election.  Isolation of nodes into groups smaller than the quorum
> will result in the cluster freezing.
>
> On Fri, Apr 30, 2010 at 11:56 AM, Lei Gao  wrote:
> Hi,
>
> I have a general question on how zookeeper can maintain its view of the
> user cluster (that zookeeper manages) that is consistent with the nodes in
> the user cluster. In other words, when zookeeper considers the current
> leader is unavailable, does it really guarantee that a majority of nodes in
> the user cluster can’t reach the current leader? The same question applies
> to the membership service as well. Because the zookeeper can be partitioned
> from a majority of the nodes in the user cluster. How does the zookeeper
> handle situations like this?
>
> Thanks,
>
> Lei
>
>
>


-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: zkCli.sh missing from zookeeper package in Cloudera CDH contrib repo?

2010-04-27 Thread Henry Robinson
Hi David -

As far as I can tell this was not a deliberate omission - earlier versions
of the package definitely had zkCli bundled with.  Apologies for the
oversight. Can you fix by copying zkCli.sh from a 3.2.* tarball from the
Apache site?

However, these packages are rather old, and we haven't updated them in a
while. In fact, I believe the contrib repo is no longer available. We at
Cloudera will be releasing first-class packages for ZooKeeper with the next
beta release of our distribution, CDH3 Beta 2. For more details see
http://www.cloudera.com/blog/2010/03/cdh3-beta1-now-available/.

cheers,
Henry

On 26 April 2010 14:05, David Rosenstrauch  wrote:

> I installed the zookeeper package (hbase-0.20-zookeeper) from the Cloudera
> CDH contrib repo.  But neither it nor the hbase-0.20 package that it's
> dependent on seem to supply the zkCli command line utility.
>
> Anyone know if this is intentional and/or how to fix?
>
> Thanks,
>
> DR
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


ZooKeeper gets three Google Summer of Code students

2010-04-26 Thread Henry Robinson
Hi -

Just wanted to announce to the community that we are lucky to have three
talented students working on Google's Summer of Code projects directly
related to ZooKeeper.

Andrei Savu  will be working with Patrick Hunt on a Web-based Administrative
Interface, extending and improving Patrick's Django-based front end.
Abmar Barros will be working with Flavio Junqueira on improving ZooKeeper's
failure detector module - making the code cleaner and easier to try out new
implementations, as well as implementing a few failure detection algorithms
himself!
Finally, Sergey Doroshenko will be working with me on a Read-Only Mode for
ZooKeeper, which will help bolster ZK's availability in certain
circumstances when a network partition is detected, as well as potentially
optimising the read-path.

(The full list of 450 GSoC students is here:
http://socghop.appspot.com/gsoc/program/list_projects/google/gsoc2010)

Congratulations to all three - we look forward to seeing what you produce
over the summer. Thanks to everyone who applied, suggested projects and
offered to mentor students; this program will have a big effect on
ZooKeeper's visibility and community, as well as hopefully producing some
great code!

cheers,
Henry

-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: python client structure

2010-04-21 Thread Henry Robinson
hildren call itself.
> > >
> > >Numeric arguments map to constants. See ``DATA`` in
> ``help(zookeeper)``
> > >for more information.
> > >
> > >Args:
> > >  rc Return code.
> > >  event Event that caused the watch (often called ``type``
> elsewhere).
> > >  stats Connection state.
> > >  path Znode that triggered this watch.
> > >
> > >Does not provide a return value.
> > >"""
> > >logger.debug('Child watch: (%d, %d, %d, %s)' % (rc, event, state,
> path))
> > >zookeeper.aget_children(self.zh, self.zparent, self.watcher,
> > > self.handler)
> > >
> > >  def run(self):
> > >while True:
> > >  pass
> > >
> > >
> > > def main():
> > >  # Allow Ctrl-C
> > >  signal.signal(signal.SIGINT, signal.SIG_DFL)
> > >
> > >  parser = OptionParser()
> > >  parser.add_option('-v', '--verbose',
> > >dest='verbose',
> > >default=True,
> > >action='store_true',
> > >help='Verbose logging. (default: %default)')
> > >  parser.add_option('--servers',
> > >dest='servers',
> > >default='localhost:2181',
> > >help='Comma-separated list of host:port pairs. (default: %default)')
> > >  global options
> > >  global args
> > >  (options, args) = parser.parse_args()
> > >
> > >  if options.verbose:
> > >logger.setLevel(logging.DEBUG)
> > >  else:
> > >logger.setLevel(logging.INFO)
> > >  formatter = logging.Formatter("%(asctime)s %(filename)s:%(lineno)d -
> > > %(message)s")
> > >  stream_handler = logging.StreamHandler()
> > >  stream_handler.setFormatter(formatter)
> > >  logger.addHandler(stream_handler)
> > >
> > >  zktest = ZKTest()
> > >  zktest.daemon = True
> > >  zktest.start()
> > >
> > >
> > > if __name__ == '__main__':
> > >  main()
> > >
> > >
> > > Thanks!
> > > Travis
> > >
> >
> >
> >
> > --
> > Henry Robinson
> > Software Engineer
> > Cloudera
> > 415-994-6679
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: python client structure

2010-04-21 Thread Henry Robinson
>help='Verbose logging. (default: %default)')
>  parser.add_option('--servers',
>dest='servers',
>default='localhost:2181',
>    help='Comma-separated list of host:port pairs. (default: %default)')
>  global options
>  global args
>  (options, args) = parser.parse_args()
>
>  if options.verbose:
>logger.setLevel(logging.DEBUG)
>  else:
>logger.setLevel(logging.INFO)
>  formatter = logging.Formatter("%(asctime)s %(filename)s:%(lineno)d -
> %(message)s")
>  stream_handler = logging.StreamHandler()
>  stream_handler.setFormatter(formatter)
>  logger.addHandler(stream_handler)
>
>  zktest = ZKTest()
>  zktest.daemon = True
>  zktest.start()
>
>
> if __name__ == '__main__':
>  main()
>
>
> Thanks!
> Travis
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Would this work?

2010-04-20 Thread Henry Robinson
Hi Avinash -

It's definitely possible to have an in-process ZK server - I've done it -
but it's not always easy. Are you passing a configuration file to
QuorumPeerMain.main? Are there any errors when you run that method? I think,
from recollection, that QPM.main should block in the standalone case, so are
you constructing the ZooKeeper object in a different thread? Are you giving
the server enough time to come up?

The error you have means that the server is not coming up for clients on
port 2181 at 10.18.39.211. Is this the right address?

cheers,
Henry

On 20 April 2010 13:25, Avinash Lakshman  wrote:

> Hi All
>
> This may sound weird but I want to know if there is something inherent that
> would preclude this from working. I want to have a thrift based service
> which exposes some API to read/write to certain znodes. I want ZK to run
> within the same process. So I will start the ZK process from within my main
> using QuorumPeerMain.main(). Now the implementation of my API would
> instantiate a ZooKeeper object and try reading/writing from specific znodes
> as the case may be. I tried running this and as soon as I instantiate my
> ZooKeeper object I get some really weird exceptions. What is wrong in this
> approach? Here is a snapshot of the stack trace:
>
> 2010-04-20 13:14:31,551 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:zookeeper.version=3.1.1-755636, built on 03/18/2009 16:52 GMT
> 2010-04-20 13:14:31,552 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:host.name=a.b.c.com
> 2010-04-20 13:14:31,552 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:java.version=1.7.0-ea
> 2010-04-20 13:14:31,552 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:java.vendor=Sun Microsystems Inc.
> 2010-04-20 13:14:31,553 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:java.home=/usr/local/jdk1.7-drop/jre
> 2010-04-20 13:14:31,553 - INFO  [pool-1-thread-1:environm...@97] - Client
>
> environment:java.class.path=config/:lib/zookeeper-3.1.1.jar:lib/log4j-1.2.15.jar:lib/antlr-2.7.7.jar:li
>
> b/antlr-3.0.1.jar:lib/atlas.jar:lib/commons-cli-1.1.jar:lib/DiscoveryService.jar:lib/fb303.jar:lib/if-java.jar:lib/jline-0.9.94.jar:lib/stringtemplate-3.0.jar:lib/thrift.jar:lib
> /atlasimpl.jar:lib/slf4j-api-1.5.8.jar:lib/slf4j-log4j12-1.5.8.jar
> 2010-04-20 13:14:31,553 - INFO  [pool-1-thread-1:environm...@97] - Client
>
> environment:java.library.path=/usr/local/jdk1.7-drop/jre/lib/amd64/server:/usr/local/jdk1.7-drop/jre/li
>
> b/amd64:/usr/local/jdk1.7-drop/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
> 2010-04-20 13:14:31,554 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:java.io.tmpdir=/tmp
> 2010-04-20 13:14:31,554 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:java.compiler=
> 2010-04-20 13:14:31,554 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:os.name=Linux
> 2010-04-20 13:14:31,555 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:os.arch=amd64
> 2010-04-20 13:14:31,555 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:os.version=2.6.12-1.1398_FC4smp
> 2010-04-20 13:14:31,555 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:user.name=root
> 2010-04-20 13:14:31,555 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:user.home=/root
> 2010-04-20 13:14:31,556 - INFO  [pool-1-thread-1:environm...@97] - Client
> environment:user.dir=/var/myservice
> 2010-04-20 13:14:31,557 - INFO  [pool-1-thread-1:zookee...@341] -
> Initiating
> client connection, host=a.b.c.com sessionTimeout=1
> watcher=a.b.c.mycl...@716c9867
> 2010-04-20 13:14:31,558 - INFO  [pool-1-thread-1:clientc...@91] -
> zookeeper.disableAutoWatchReset is false
> 2010-04-20 13:14:31,566 - INFO
>  [pool-1-thread-1-SendThread:clientcnxn$sendthr...@800] - Attempting
> connection to server a.b.c.com/10.18.39.211:2181
> 2010-04-20 13:14:31,567 - WARN
>  [pool-1-thread-1-SendThread:clientcnxn$sendthr...@898] - Exception
> closing
> session 0x0 to sun.nio.ch.selectionkeyi...@7b2884e0
> java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:864)
> 2010-04-20 13:14:31,568 - WARN
>  [pool-1-thread-1-SendThread:clientcnxn$sendthr...@932] - Ignoring
> exception
> during shutdown input
> java.nio.channels.ClosedChannelException
>at
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:656)
>at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:378)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:930)
>at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:901)
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: the error

2010-03-31 Thread Henry Robinson
Using two machines running ZK will actually decrease your reliability
>> compared to using a single machine.  Consider using one machine or three.
>>
>
> ?
>
> Not meaning to pull the thread off-topic, but I don't understand why this
> should be the case.  Can you elaborate?
>
>
With majority-based quorums, a 2 node ensemble will fail if either machine
fails. This is the same as a 1 node ensemble - if one machine fails, the
ensemble fails.

But with the 2 nodes, you have (roughly) double the probability of
experiencing a single failure, with the same failure profile.

cheers,
Henry


> Thanks,
>
> DR
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: How to ensure trasaction create-and-update

2010-03-29 Thread Henry Robinson
On 29 March 2010 19:10, Ted Dunning  wrote:

> This is not a good thing.  ZK gains lots of its power and reliability by
> not
> trying to do atomic updates to multiple znodes at once.
>
>
Ted -

Could you say a bit about how you feel ZK would sacrifice power and
reliability through multi-node updates? My view is that it wouldn't: since
all operations are executed serially, there's no concurrency to be lost by
allowing multi-updates, and there doesn't need to be a 'start / end'
transactional style interface (which I do believe would be very bad).

I could see ZK implement a Sinfonia-style batch operation API which makes
all-or-none updates. The reason I can see that it doesn't already allow this
is the avowed intent of the original ZK team to keep the API as simple as it
can reasonably be, and to not introduce complexity without need.

cheers,
Henry



> Can you say more about the update that you want to do?  It is common for
> updates like to be such that you can order the updates and do without a
> truly atomic transaction.  For instance if one file is a list of other
> files
> (say for a queue) and you need to create a file and add a reference in the
> list of files, you can generally be safe creating the new file first and
> then doing an atomic update on the list of files secondly.  If your process
> fails between the two operations, then you may generate a small number of
> garbage files (this number can be substantially decreased by careful use of
> try/finally) which might require a cleanup process to run occasionally to
> find unreferenced and old files.
>
> On Mon, Mar 29, 2010 at 6:54 PM, zd.wbh  wrote:
>
> >   we'd like to store some metadata in zookeeper in our upcoming project,
> > here is a special but common case: we need to create a new znode, in the
> > mean while, update another znode data. These manipulation(a create and a
> > update) need to be done as atom. We don't want to see a successful
> creation
> > and a failure updating. Is there a convenient way to ensure this
> operation?
> > Can you give me some tips?
> >
> >I've looked into the src code, there is a tedious way to do. Extend
> > zookeeper instruction, struct a "createAndUpdate" interface and a txn
> > request, let DataTree to ensure the integrity. Will this do and the only
> > way?
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: deleting a node - command line tool

2010-03-26 Thread Henry Robinson
Making delete optionally recursive would be a nice patch to have
(hint, hint ;)).

I'm not sure if the cli allows for a one-shot command (i.e.
bin/zkCli.sh -server localhost:2181 -exec delete /katta) - but if it
doesn't, it shouldn't be hard to add.

Henry

On 26 March 2010 10:06, Nick Dimiduk  wrote:
> The delete command provided by bin/zkCli.sh will delete a leaf node but is
> not recursive. I don't have a copy on my desk, but I believe there's code in
> the O'Reilly Hadoop book for recursive node delete.
>
> -Nick
>
> On Fri, Mar 26, 2010 at 9:42 AM, Karthik K  wrote:
>
>> Hi -
>>  I am looking to delete a node (say, /katta) from a running zk ensemble
>> altogether and curious if there is any command-line tool that is available
>> that can do a delete.
>>
>> --
>>   Karthik.
>>
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Modify ZooKeeper Java client to hold weak references to Watcher objects

2010-03-18 Thread Henry Robinson
Yes - the watchers aren't simply relay objects, they typically actually
process the callback.

Scaling out the watchers in a single client is a laudable aim, but I think
this proposal would impact some common use cases.

Henry

On 18 March 2010 15:47, Ted Dunning  wrote:

> This kind of sounds strange to me.
>
> My typical idiom is to create a watcher but not retain any references to it
> outside the client.  It sounds to me like your change will cause my
> watchers
> to be collected and deactivated when GC happens.
>
> On Thu, Mar 18, 2010 at 3:32 AM, Dominic Williams <
> thedwilli...@googlemail.com> wrote:
>
> >
> > The current ZooKeeper client holds strong references to Watcher objects.
> I
> > want to change the client so it only holds weak references. Feedback
> > please.
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: persistent storage and node recovery

2010-03-15 Thread Henry Robinson
The advantages of a DHT often include:

1. bounded size routes
2. load balancing
3. dynamic membership

at the cost of only making very weak consistency guarantees. Typically a DHT
is used for very read heavy workloads - such as CDNs - where the p2p
approach is very scalable. But it's extremely hard to make consistent
updates, because generally to do so you need to make sure a majority of the
replicas of a given item are updated at the same time. ZooKeeper won't scale
as far as a DHT (talking about billions of entries), but it does ensure that
all clients see a linearizable, consistent history on all updates. There is
a fundamental tension between synchronicity of updates and scale.

Henry

On 15 March 2010 18:17, Maxime Caron  wrote:

> I now understand that ZK is NOT a distributed hash table.
> I only wondered if it where possible to build one with the same level of
> consistency by using ordered updates log like ZK does.
> If it is possible i thing it would be a cool solution to a lot of problem
> out there, not neeserly the same one ZK try to solve.
> Something along the line of Wuala
> http://www.youtube.com/watch?v=3xKZ4KGkQY8
>
> On 15 March 2010 21:28, Ted Dunning  wrote:
>
> > I don't think that you have considered the impact of ordered updates
> here.
> >
> > On Mon, Mar 15, 2010 at 6:19 PM, Maxime Caron  > >wrote:
> >
> > > So this is all about the "operation log" so if a node is in minority
> but
> > > have more recent committed value this node is in Veto over the other
> > node.
> > >
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: persistent storage and node recovery

2010-03-15 Thread Henry Robinson
Hi Maxime -

I'm not very familiar with Scalaris, but I can answer for the ZooKeeper side
of things.

ZooKeeper servers log each operation to a persistent store before they vote
on the outcome of that operation. So if a vote passes, we know that a
majority of servers has written that operation to disk. Then, if a node
fails and restarts, it can read all the committed operations from disk. As
long as a majority of nodes is still working, at least one of them will have
seen all the committed operations.

If we didn't do this, the loss of a majority of servers (even if they
restarted) could mean that updates are lost. But ZooKeeper is meant to be
durable - once a write is made, it will persist for the lifetime of the
system if it is not overwritten later. So in order to properly tolerate
crash failures and not lose any updates, you have to make sure a majority of
servers write to disk.

There is no possibility of more replicas being in the system than are
allowed - you start off with a fixed number, and never go above it.

Hope this helps - let me know if you have any further questions!

Henry

-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

On 15 March 2010 16:47, Maxime Caron  wrote:

> Hi everybody,
>
> From what i understand Zookeeper consistency model work the same way as
> does
> Scalaris.
> Which is to keep the majority of the replica for an item UP.
>
> In Scalaris i
>
> f a single failed node does crash and recover, it simply start like a fresh
> new node and all data is lost.
>
> This is the case because it may otherwise get some inconsistencies as
> another node already took over.
>
> For a short timeframe there might be more replicas in the system than
> allowed, which destroys the proper functioning of our majority based
> algorithms.
>
> So my question is how Zookeeper use the persistent storage during node
> recovery, how does the
>
> majority based algorithms is different so consistency is preserved.
>
>
> Thanks a lots
>
> Maxime Caron
>


Re: Zookeeper unit tester?

2010-03-09 Thread Henry Robinson
Not to my knowledge, although such a thing would be nice to have. We are
very busy putting together the 3.3.0 release for the next few days, and
after that will be thinking about directions for 3.4.0 - testability will
definitely come up.

If this is something you're keen to have, please do create a JIRA (and even
better, consider contributing ;)).

cheers,
Henry

On 9 March 2010 14:23, David Rosenstrauch  wrote:

> Just wondering if there was a mock/fake version of
> org.apache.zookeeper.Zookeeper that could be used for unit testing? What I'm
> envisioning would be a single instance Zookeeper that operates completely in
> memory, with no network or disk I/O.
>
> This would make it possible to pass one of the memory-only FakeZookeeper's
> into unit tests, while using a real Zookeeper in production code.
>
> Any such "animal"?  :-)
>
> Thanks,
>
> DR
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: zookeeper utils

2010-03-02 Thread Henry Robinson
Just to illustrate one of the primitives you're looking for: an
AtomicInteger equivalent would be fairly easy to construct, with nearly
identical semantics to the Java version.

Let's say a given znode has four bytes of data that represent an integer
value. Get operations or set operations are easy, as ZK will make sure that
all operations are atomic, so they happen in a linearizable order. A
get-and-set operation can be performed by reading the znode via getData, and
performing a conditional update using the version number of the znode that
was returned as part of the getData operation. ZooKeeper's setData operation
takes an optional version number (set it to -1 to ignore it) which tells the
operation to succeed only if the znode's version hasn't changed since. Other
operations can use this procedure as a base.

This is exactly how Java's getAndSet is implemented - neither implementation
is wait-free, but they are still lock-free: some process will always make
progress.

Hope this helps - let me know if you'd like more detail on exactly how to
build this.

Henry

-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

On 2 March 2010 20:18, David Rosenstrauch  wrote:

> On 03/02/2010 05:52 PM, Ted Dunning wrote:
>
>> What other examples are you looking for?
>>
>> On Tue, Mar 2, 2010 at 1:04 PM, David Rosenstrauch> >wrote:
>>
>>  Is there a library of higher-level zookeeper utilities that people have
>>> contributed, beyond the barrier and queue examples provided in the docs?
>>>
>>
> Well, first off, I'm trying to get familiar with Zookeeper's capabilities,
> and I figured that'd be a good place to start.
>
> Aside from that, though, we're going to need something like AtomicInteger
> for an app I'm about to start working on, so I was looking to see if there
> was already some code out there that got me all or part of the way there.
>
> DR
>


Re: Usage of myId

2010-03-01 Thread Henry Robinson
If you have two servers with the same myid, two servers will identify
themselves as the 'same' machine X in a ZooKeeper ensemble. This id is used
to map onto a hostname / port pair where messages for a given server are
sent. Assuming a consistent quorum specification across all machines,
messages for server X will only go to one machine and the other will think
itself partitioned from the network.

Servers need ids to distinguish themselves from other servers in order to
break symmetry and successfully elect a leader.

Henry

On 27 February 2010 23:06, Qian Ye  wrote:

> myid is used to identify your service instance, with its help, it is
> possible to start more than one Zookeeper service on one computer. If the
> configuration of myid is wrong, the service can not be started properly.
>
> On Sun, Feb 28, 2010 at 11:39 AM, Avinash Lakshman <
> avinash.laksh...@gmail.com> wrote:
>
> > Why is this important? What breaks down if I have 2 servers with the same
> > myId?
> >
> > Cheers
> > A
> >
>
>
>
> --
> With Regards!
>
> Ye, Qian
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: is there a good pattern for leases ?

2010-02-24 Thread Henry Robinson
A cautionary note with this problem - who says when 2 minutes is up? Clocks
will go forward at different rates and with different offsets. You cannot
rely on two machines having the same perception of what 2 minutes means. In
general, in distributed systems, it's a good design principle to minimise
any dependence on a common notion of real time.

That said the best way is to pick some machine, like Mahadev says, to retire
old locks by polling every N seconds, where N is the slop you can afford.

What problem are you actually trying to solve?

cheers,
Henry

On 24 February 2010 03:40, Martin Waite  wrote:

> Hi,
>
> Is there a good model for implementing leases in Zookeeper ?
>
> What I want to achieve is for a client to create a lock, and for that lock
> to disappear two minutes later - regardless of whether the client is still
> connected to zk.   Like ephemeral nodes - but with a time delay.
>
> regards,
> Martin
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: Q about ZK internal: how commit is being remembered

2010-01-27 Thread Henry Robinson
Hi -

Note that a machine that has the highest received zxid will necessarily have
seen the most recent transaction that was logged by a quorum of followers
(the FIFO property of TCP again ensures that all previous messages will have
been seen). This is the property that ZAB needs to preserve. The idea is to
avoid missing a commit that went to a node that has since failed.

I was therefore slightly imprecise in my previous mail - it's possible for
only partially-proposed proposals to be committed if the leader that is
elected next has seen them. Only when another proposal is committed instead
must the original proposal be discarded.

I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
subject, for those with portal.acm.org access:
http://portal.acm.org/citation.cfm?id=1529978

Henry

On 27 January 2010 21:52, Qian Ye  wrote:

> Hi Henry:
>
> According to your explanation, "*ZAB makes the guarantee that a proposal
> which has been logged by
> a quorum of followers will eventually be committed*" , however, the source
> code of Zookeeper, the FastLeaderElection.java file, shows that, in the
> election, the candidates only provide their zxid in the votes, the one with
> the max zxid would win the election. I mean, it seems that no check has
> been
> made to make sure whether the latest proposal has been logged by a quorum
> of
> servers.
>
> In this situation, the zookeeper would deliver a proposal, which is known
> as
> a failed one by the client. Imagine this scenario, a zookeeper cluster with
> 5 servers, Leader only receives 1 ack for proposal A, after a timeout, the
> client is told that the proposal failed. At this time, all servers restart
> due to a power failure. The server have the log of proposal A would be the
> leader, however, the client is told the proposal A failed.
>
> Do I misunderstand this?
>
>
> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson 
> wrote:
>
> > Qing -
> >
> > That part of the documentation is slightly confusing. The elected leader
> > must have the highest zxid that has been written to disk by a quorum of
> > followers. ZAB makes the guarantee that a proposal which has been logged
> by
> > a quorum of followers will eventually be committed. Conversely, any
> > proposals that *don't* get logged by a quorum before the leader sending
> > them
> > dies will not be committed. One of the ZAB papers covers both these
> > situations - making sure proposals are committed or skipped at the right
> > moments.
> >
> > So you get the neat property that leader election can be live in exactly
> > the
> > case where the ZK cluster is live. If a quorum of peers aren't available
> to
> > elect the leader, the resulting cluster won't be live anyhow, so it's ok
> > for
> > leader election to fail.
> >
> > FLP impossibility isn't actually strictly relevant for ZAB, because FLP
> > requires that message reordering is possible (see all the stuff in that
> > paper about non-deterministically drawing messages from a potentially
> > deliverable set). TCP FIFO channels don't reorder, so provide the extra
> > signalling that ZAB requires.
> >
> > cheers,
> > Henry
> >
> > 2010/1/26 Qing Yan 
> >
> > > Hi,
> > >
> > > I have question about how zookeeper *remembers* a commit operation.
> > >
> > > According to
> > >
> > >
> >
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
> > >
> > > 
> > >
> > >
> > > The leader will issue a COMMIT to all followers as soon as a quorum of
> > > followers have ACKed a message. Since messages are ACKed in order,
> > COMMITs
> > > will be sent by the leader as received by the followers in order.
> > >
> > > COMMITs are processed in order. Followers deliver a proposals message
> > when
> > > that proposal is committed.
> > > 
> > >
> > > My question is will leader wait for COMMIT to be processed by quorum
> > > of followers before consider
> > > COMMIT to be success? From the documentation it seems that leader
> handles
> > > COMMIT asynchronously and
> > > don't expect confirmation from followers. In the extreme case, what
> > happens
> > > if leader issue a COMMIT
> > > to all followers and crash immediately before the COMMIT message can go
> > out
> > > of the network. How the system
> > > remembers the COMMIT ever happens?
> > >
> > > Actually this is related to the leader election process

Re: Q about ZK internal: how commit is being remembered

2010-01-26 Thread Henry Robinson
Qing -

That part of the documentation is slightly confusing. The elected leader
must have the highest zxid that has been written to disk by a quorum of
followers. ZAB makes the guarantee that a proposal which has been logged by
a quorum of followers will eventually be committed. Conversely, any
proposals that *don't* get logged by a quorum before the leader sending them
dies will not be committed. One of the ZAB papers covers both these
situations - making sure proposals are committed or skipped at the right
moments.

So you get the neat property that leader election can be live in exactly the
case where the ZK cluster is live. If a quorum of peers aren't available to
elect the leader, the resulting cluster won't be live anyhow, so it's ok for
leader election to fail.

FLP impossibility isn't actually strictly relevant for ZAB, because FLP
requires that message reordering is possible (see all the stuff in that
paper about non-deterministically drawing messages from a potentially
deliverable set). TCP FIFO channels don't reorder, so provide the extra
signalling that ZAB requires.

cheers,
Henry

2010/1/26 Qing Yan 

> Hi,
>
> I have question about how zookeeper *remembers* a commit operation.
>
> According to
>
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
>
> 
>
>
> The leader will issue a COMMIT to all followers as soon as a quorum of
> followers have ACKed a message. Since messages are ACKed in order, COMMITs
> will be sent by the leader as received by the followers in order.
>
> COMMITs are processed in order. Followers deliver a proposals message when
> that proposal is committed.
> 
>
> My question is will leader wait for COMMIT to be processed by quorum
> of followers before consider
> COMMIT to be success? From the documentation it seems that leader handles
> COMMIT asynchronously and
> don't expect confirmation from followers. In the extreme case, what happens
> if leader issue a COMMIT
> to all followers and crash immediately before the COMMIT message can go out
> of the network. How the system
> remembers the COMMIT ever happens?
>
> Actually this is related to the leader election process:
>
> 
> ZooKeeper messaging doesn't care about the exact method of electing a
> leader
> has long as the following holds:
>
>   -
>
>   The leader has seen the highest zxid of all the followers.
>   -
>
>   A quorum of servers have committed to following the leader.
>
>  Of these two requirements only the first, the highest zxid amoung the
> followers needs to hold for correct operation.
>
> 
>
> Is there a liveness issue try to find "The leader has seen the highest zxid
> of all the followers"? What if some of the followers (which happens to
> holding the highest zxid) cannot be contacted(FLP impossible result?)
>  It will be more striaghtforward if COMMIT requires confirmation from a
> quorum of the followers. But I guess things get
> optimized according to Zab's FIFO nature...just want to hear some
> clarification about it.
>
> Thanks alot!
>


Re: ZAB kick Paxos butt?

2010-01-20 Thread Henry Robinson
Qing -

Also, as you pointed out, ZAB requires this FIFO property of the
point-to-point links. Paxos copes with more adversarial networks which allow
reordering and missed messages. It's easy to alter Paxos so as not to
'publish' the results of consensus rounds where there are gaps in the
previous commit history. (You may be interested in the 'Fast Paxos' paper by
Lamport which talks about making the protocol 2-message optimal in all cases
when order is not important, i.e. the messages commute). You can express the
ordering dependency between messages by supplying a proposal number with
each that is monotonically increasing in causal order.

ZAB takes care of all of this for you by using TCP sequence numbers and
getting the deep pipelining available by knowing that there are no updates
being voted on depend on updates that have not yet arrived, the 'cost' is
relying on a stronger network model than Paxos presupposes.

Henry

2010/1/20 Benjamin Reed 

> hi Qing,
>
> i'm glad you like the page and Zab.
>
> yes, we are very familiar with Paxos. that page is meant to show a weakness
> of Paxos and a design point for Zab. it is not to say Paxos is not useful.
> Paxos is used in the real world in production systems. sometimes there are
> not order dependencies between messages, so Paxos is fine.
>
> in cases where order is important, multiple messages are batched into a
> single operation and only one operation is outstanding at a time. (i believe
> that this is what Chubby does, for example.) this is the solution you allude
> to: wait for 27 to commit before 28 is issued.
>
> for ZooKeeper we do have order dependencies and we wanted to have multiple
> operations in progress at various stages of the pipeline to allow us to
> lower latencies as well as increase our bandwidth utilization, which led us
> to Zab.
>
> ben
>
>
> Qing Yan wrote:
>
>> Hello,
>>Anyone familer with Paxos protocol here?
>>I was doing some comparision of ZAB vs Paxos... first of all, ZAB's
>> FIFO
>> based protocol is really cool!
>>
>>  http://wiki.apache.org/hadoop/ZooKeeper/PaxosRun mentioned the
>> inconsistency case for Paxos("the state change B depends upon A, but A was
>> not committed").
>>  In the "Paxos made simple" paper, author suggests fill the GAP (lost
>> state
>> machine changes) with "NO OP" opeartion.
>>
>>  Now I have some serious doubts how could Paxos be any useful in the real
>> world. yeah you do reach the consesus - albeit the content
>> is inconsistent/corrupted !?
>>
>>  E.g. on the wiki page, why the Paxos state machine allow fire off 27,28
>> concurrently where there is actually depedency? Shouldn't you wait
>> instance
>> 27 to be committed before start 28?
>>  Did I miss something?
>>
>>  Thanks for the enlight!
>>
>>   Cheers
>>
>>Qing
>>
>>
>
>


Re: Namespace partitioning ?

2010-01-15 Thread Henry Robinson
Sounds like it is definitely worth a JIRA - please do create one!
Keeping the discussion together can focus it, and is much more likely
to lead to patches :)

Henry

2010/1/15 Kay Kay :
> Thanks Mahadev /  Flavio for the pointers.
>
> There are definitely some practical scenarions  that we feel would be useful
> with this / that I should be able to put down over the w/e .
>
> Curious ,  does this warrant a jira to consolidate the discussion to keep in
> one place ? I had been trying to gather bits and pieces from various sources
> .
>
>
>
> On 1/15/10 1:17 AM, Flavio Junqueira wrote:
>>
>> Hi, Mahadev said it all, we have been thinking about it for a while, but
>> haven't had time to work on it. I also don't think we have a jira open for
>> it; at least I couldn't find one. But, we did put together some comments:
>>
>>    http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZookeeper
>>
>> One of the main issues we have observed there is that partitioning will
>> force us to change our consistency guarantees, which is far from ideal.
>> However, some users seem to be ok with it, but I'm not sure we have
>> agreement.
>>
>> In any case, please feel free to contribute or simply express your
>> interests so that we can take them into account.
>>
>> Thanks,
>> -Flavio
>>
>>
>> On Jan 15, 2010, at 12:49 AM, Mahadev Konar wrote:
>>
>>> Hi kay,
>>>  the namespace partitioning in zookeeper has been on a back burner for a
>>> long time. There isnt any jira open on it. There had been some
>>> discussions
>>> on this but no real work. Flavio/Ben have had this on there minds for a
>>> while but no real work/proposal is out yet.
>>>
>>> May I know is this something you are looking for in production?
>>>
>>> Thanks
>>> mahadev
>>>
>>>
>>> On 1/14/10 3:38 PM, "Kay Kay"  wrote:
>>>
 Digging up some old tickets + search results - I am trying to understand
 what current state is , w.r.t support for namespace partitioning in
 zookeeper.  Is it already in / any tickets-mailing lists to understand
 the current state.



>>>
>>
>>
>
>


Re: Question regarding Membership Election

2010-01-14 Thread Henry Robinson
Hi -

If you put all your voting nodes in one datacenter, that datacenter becomes
a 'single point of failure' for the cluster. If it gets cut off from any
other datacenters, the cluster will not be available to those datacenters.

If you want to withstand the failure of datacenters, then you need voting
members inside every datacenter. Observers can't suddenly become voting
members.

You can't even put 'dormant' voting members (that never bother to vote) in
your other datacenters because you would need a quorum of them to continue
after the original datacenter failed. And if this was true, the original
datacenter would not, by construction, contain a quorum of voting nodes. So
you'd still have to vote outside the cluster.

Henry

2010/1/14 Vijay 

> Hi,
>
> I read about observers in other datacenter,
>
> My question is i dont want voting across the datacenters (So i will use
> observers), at the same time when a DC goes down i dont want to loose the
> cluster, whats the solution for it?
>
> I have to have 3 nodes in primary DC to accept 1 node failure. Thats
> fine...
> but what about the other DC? how many nodes and how will i make it work?
>
> Regards,
> 
>


Re: Killing a zookeeper server

2010-01-12 Thread Henry Robinson
Hi Adam -

As long as a quorum of servers is running, ZK will be live. With majority
quorums, 2/3 is enough to keep going. In general, if fewer than half your
nodes have failed, ZK will keep on keeping on.

The main concern with a cluster of 2/3 machines is that a single further
failure will bring down the whole cluster.

Henry

2010/1/12 Adam Rosien 

> I have a related question: what's the behavior of a cluster of 3 when
> one is down? I've tried it and a leader is elected, but are there any
> other caveats for this situation?
>
> .. Adam
>
> On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt  wrote:
> > 12 servers? That's alot, if you dont' mind my asking why so many?
> Typically
> > we recommend 5 - that way you can have one down for maintenance and still
> > have a failure that doesn't bring down the cluster.
> >
> > The "electing a leader" is probably the restarted machine attempting to
> > re-join the ensemble (it should join as a follower if you have a leader
> > already elected, given that it's xid is behind the existing leader.) Hard
> to
> > tell though without the logs.
> >
> > You might also be seeing the initLimit exceeded, is the data you are
> storing
> > in ZK large? Or perhaps network connectivity is slow?
> >
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
> > again the logs would give some insight on this.
> >
> >
> > Patrick
> >
> > Nick Bailey wrote:
> >>
> >> We are running zookeeper 3.1.0
> >>
> >> Recently we noticed the cpu usage on our machines becoming
> >> increasingly high and we believe the cause is
> >>
> >> https://issues.apache.org/jira/browse/ZOOKEEPER-427
> >>
> >> However our solution when we noticed the problem was to kill the
> >> zookeeper process and restart it.
> >>
> >> After doing that though it looks like the newly restarted zookeeper
> >> server is continually attempting to elect a leader even though one
> >> already exists.
> >>
> >> The process responses with 'imok' when asked, but the stat command
> >> returns 'ZooKeeperServer not running'.
> >>
> >> I belive that killing the current leader should trigger all servers
> >> to do an election and solve the problem, but I'm not sure. Should
> >> that be the course of action in this situation?
> >>
> >> Also we have 12 servers, but 5 are currently not running according to
> >> stat.  So I guess this isn't a problem unless we lose another one.
> >> We have plans to upgrade zookeeper to solve the cpu issue but haven't
> >> been able to do that yet.
> >>
> >> Any help appreciated, Nick Bailey
> >>
> >
>


Re: Return data size in zkpython

2009-12-15 Thread Henry Robinson
Hi -

See https://issues.apache.org/jira/browse/ZOOKEEPER-627, and the attached
patch. I've upped the limit to a 1Mb buffer. Also I've added a fourth
parameter to zookeeper.get - if you set this integer parameter to the size
of the buffer you are expecting, zkpython will return no more than this many
bytes.

Thanks again for flagging this up.

cheers,
Henry

On Tue, Dec 15, 2009 at 4:43 PM, Henry Robinson  wrote:

> Hey Rich -
>
> That's a really dumb restriction :) I'll open a JIRA and get it fixed asap.
>
> Thanks for the report!
>
> Henry
>
>
> On Tue, Dec 15, 2009 at 4:38 PM, Rich Schumacher wrote:
>
>> Hey all,
>>
>> I'm working on using ZooKeeper for an internal application at Digg.  I've
>> been using the zkpython package and I just noticed that the data I was
>> receiving from a zookeeper.get() call was being truncated.  After some quick
>> digging I found that zookeeper.c limits the data returned to 512 characters
>> (see
>> http://svn.apache.org/viewvc/hadoop/zookeeper/tags/release-3.2.2/src/contrib/zkpython/src/c/zookeeper.c?view=markupline
>>  855).
>>
>> Is there a reason for this?  The only information regarding node size that
>> I've read is that it should not exceed 1MB so this limit seems a bit
>> arbitrary and restrictive.
>>
>> Thanks for the great work!
>>
>> Rich
>
>
>


Re: Return data size in zkpython

2009-12-15 Thread Henry Robinson
Hey Rich -

That's a really dumb restriction :) I'll open a JIRA and get it fixed asap.

Thanks for the report!

Henry

On Tue, Dec 15, 2009 at 4:38 PM, Rich Schumacher wrote:

> Hey all,
>
> I'm working on using ZooKeeper for an internal application at Digg.  I've
> been using the zkpython package and I just noticed that the data I was
> receiving from a zookeeper.get() call was being truncated.  After some quick
> digging I found that zookeeper.c limits the data returned to 512 characters
> (see
> http://svn.apache.org/viewvc/hadoop/zookeeper/tags/release-3.2.2/src/contrib/zkpython/src/c/zookeeper.c?view=markupline
>  855).
>
> Is there a reason for this?  The only information regarding node size that
> I've read is that it should not exceed 1MB so this limit seems a bit
> arbitrary and restrictive.
>
> Thanks for the great work!
>
> Rich


Re: Starting Zookeeper on Amazon EC2

2009-12-09 Thread Henry Robinson
Nearly! 1+2 are correct, but you also need to start ZooKeeper on all three
instances with bin/zkServer.sh start.

Henry

On Wed, Dec 9, 2009 at 11:00 AM, Something Something <
mailinglist...@gmail.com> wrote:

> Now that I have your attention..next question... :)
>
> Now I would like to start a Zookeeper Quorum on 3 EC Instances.  Read the
> doc regarding... "Running Replicated ZooKeeper".  It says "all servers in
> the quorum should have the same configuration file"..  Does this mean... I
> should..
>
> 1)  Download & Install ZooKeeper on all 3 instances (at the same location.)
> 2)  Save the same zoo.cfg in /conf for all 3 instances.
> 3)  On one instance (Master?), run...
>
> bin/zkServer.sh start
>
> Would that start ZooKeeper on all 3 instances?  Thanks for the help.
>
>
> On Wed, Dec 9, 2009 at 10:24 AM, Something Something <
> mailinglist...@gmail.com> wrote:
>
> > Switched to 3.2.1.  Much better.  Got a command prompt.  Thank you both.
> >
> >
> > On Wed, Dec 9, 2009 at 10:09 AM, Henry Robinson  >wrote:
> >
> >> The 3.2.1 command line is a lot nicer (has an actual prompt, tab
> >> auto-completion, shows your connection status etc) - if you can upgrade
> to
> >> 3.2.1 which is a good deal more modern, I would recommend it. If I
> recall
> >> correctly, there was no prompt in 3.1.1...
> >>
> >> Henry
> >>
> >> On Wed, Dec 9, 2009 at 9:36 AM, Something Something <
> >> mailinglist...@gmail.com> wrote:
> >>
> >> > Without -server made some progress, but don't see a command prompt.
> >> > Shouldn't I see one?
> >> >
> >> > This is what I see:
> >> > 2009-12-09 17:27:56,709 - INFO  [main:zookee...@341] - Initiating
> >> client
> >> > connection, host=127.0.0.1:2181 sessionTimeout=5000
> >> > watcher=org.apache.zookeeper.zookeepermain$mywatc...@32fb4f
> >> > 2009-12-09 17:27:56,710 - INFO  [main:clientc...@91] -
> >> > zookeeper.disableAutoWatchReset is false
> >> > 2009-12-09 17:27:56,792 - INFO
> >>  [main-SendThread:clientcnxn$sendthr...@800
> >> > ]
> >> > - Attempting connection to server /127.0.0.1:2181
> >> > 2009-12-09 17:27:56,802 - INFO
> >>  [main-SendThread:clientcnxn$sendthr...@716
> >> > ]
> >> > - Priming connection to java.nio.channels.SocketChannel[connected
> >> local=/
> >> > 127.0.0.1:49619 remote=/127.0.0.1:2181]
> >> > 2009-12-09 17:27:56,806 - INFO
> >>  [main-SendThread:clientcnxn$sendthr...@868
> >> > ]
> >> > - Server connection successful
> >> > WatchedEvent: Server state change. New state: SyncConnected
> >> >
> >> >
> >> > Should I just use 3.2.1 version?
> >> >
> >> >
> >> >
> >> > On Wed, Dec 9, 2009 at 9:20 AM, Mahadev Konar 
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >  Can you try this?
> >> > >
> >> > > bin/zkCli.sh 127.0.0.1:2181
> >> > >
> >> > > The -server command was added later as far as I remember.
> >> > >
> >> > > Thanks
> >> > > mahadev
> >> > >
> >> > >
> >> > >
> >> > > On 12/9/09 9:05 AM, "Something Something"  >
> >> > > wrote:
> >> > >
> >> > > > I am trying to start ZooKeeper on an EC2 instance.  Here's what I
> >> did:
> >> > > >
> >> > > > 1)  Downloaded & Unpacked ZooKeeper 3.1.1 on EC2 instance.
> >> > > > 2)  cp /conf/zoo_sample.cfg /conf/zoo.cfg
> >> > > > 3)  Changed the dataDir path to point to my EBS volume.
> >> > > > 4)  In one command window, ran /bin/zkServer.sh start
> >> > > > (The last message I see is... "Snapshotting: 0)
> >> > > >
> >> > > > 5)  Opened another command window, and ran jps
> >> > > > (This shows a new process called, QuorumPeerMain.  That's the only
> >> one
> >> > I
> >> > > > see.)
> >> > > >
> >> > > > 6)  As per documentation, tried
> >> > > >
> >> > > > bin/zkCli.sh -server 127.0.0.1:2181
> >> > > >
> >> > > > (This gives me IOException: USAGE)
> >> > > >
> >> > > > 7) So I ran:
> >> > > >
> >> > > > bin/zkCli.sh -server 127.0.0.1:2181 ls
> >> > > >
> >> > > > Got UnknownHostException: -server
> >> > > >
> >> > > > 8)  So I tried various ways of specifying IP address in EC2, such
> >> as:
> >> > > >
> >> > > > 10.xx.xx.xx
> >> > > > ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
> >> > > > domU-12-31-xx-xx-xx-xx.compute-1.internal
> >> > > > domU-12-31-xx-xx-xx-xx
> >> > > >
> >> > > > None of them worked.  Keep getting UnknownHostException.
> >> > > >
> >> > > > What am I doing wrong.  Please help.  Thanks.
> >> > >
> >> > >
> >> >
> >>
> >
> >
>


Re: Starting Zookeeper on Amazon EC2

2009-12-09 Thread Henry Robinson
The 3.2.1 command line is a lot nicer (has an actual prompt, tab
auto-completion, shows your connection status etc) - if you can upgrade to
3.2.1 which is a good deal more modern, I would recommend it. If I recall
correctly, there was no prompt in 3.1.1...

Henry

On Wed, Dec 9, 2009 at 9:36 AM, Something Something <
mailinglist...@gmail.com> wrote:

> Without -server made some progress, but don't see a command prompt.
> Shouldn't I see one?
>
> This is what I see:
> 2009-12-09 17:27:56,709 - INFO  [main:zookee...@341] - Initiating client
> connection, host=127.0.0.1:2181 sessionTimeout=5000
> watcher=org.apache.zookeeper.zookeepermain$mywatc...@32fb4f
> 2009-12-09 17:27:56,710 - INFO  [main:clientc...@91] -
> zookeeper.disableAutoWatchReset is false
> 2009-12-09 17:27:56,792 - INFO  [main-SendThread:clientcnxn$sendthr...@800
> ]
> - Attempting connection to server /127.0.0.1:2181
> 2009-12-09 17:27:56,802 - INFO  [main-SendThread:clientcnxn$sendthr...@716
> ]
> - Priming connection to java.nio.channels.SocketChannel[connected local=/
> 127.0.0.1:49619 remote=/127.0.0.1:2181]
> 2009-12-09 17:27:56,806 - INFO  [main-SendThread:clientcnxn$sendthr...@868
> ]
> - Server connection successful
> WatchedEvent: Server state change. New state: SyncConnected
>
>
> Should I just use 3.2.1 version?
>
>
>
> On Wed, Dec 9, 2009 at 9:20 AM, Mahadev Konar 
> wrote:
>
> > Hi,
> >  Can you try this?
> >
> > bin/zkCli.sh 127.0.0.1:2181
> >
> > The -server command was added later as far as I remember.
> >
> > Thanks
> > mahadev
> >
> >
> >
> > On 12/9/09 9:05 AM, "Something Something" 
> > wrote:
> >
> > > I am trying to start ZooKeeper on an EC2 instance.  Here's what I did:
> > >
> > > 1)  Downloaded & Unpacked ZooKeeper 3.1.1 on EC2 instance.
> > > 2)  cp /conf/zoo_sample.cfg /conf/zoo.cfg
> > > 3)  Changed the dataDir path to point to my EBS volume.
> > > 4)  In one command window, ran /bin/zkServer.sh start
> > > (The last message I see is... "Snapshotting: 0)
> > >
> > > 5)  Opened another command window, and ran jps
> > > (This shows a new process called, QuorumPeerMain.  That's the only one
> I
> > > see.)
> > >
> > > 6)  As per documentation, tried
> > >
> > > bin/zkCli.sh -server 127.0.0.1:2181
> > >
> > > (This gives me IOException: USAGE)
> > >
> > > 7) So I ran:
> > >
> > > bin/zkCli.sh -server 127.0.0.1:2181 ls
> > >
> > > Got UnknownHostException: -server
> > >
> > > 8)  So I tried various ways of specifying IP address in EC2, such as:
> > >
> > > 10.xx.xx.xx
> > > ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
> > > domU-12-31-xx-xx-xx-xx.compute-1.internal
> > > domU-12-31-xx-xx-xx-xx
> > >
> > > None of them worked.  Keep getting UnknownHostException.
> > >
> > > What am I doing wrong.  Please help.  Thanks.
> >
> >
>


Re: Observers!

2009-11-18 Thread Henry Robinson
Thanks! Also thanks are due to the entire ZK committer team who helped
enormously in getting the patch into shape.

Since the JIRA is now a long and complicated read, I want to summarise a
couple of important points here (although the docs have most of this
information).

1. Observers must currently be used with electionAlg=0; this is due to a
limitation of the other election algorithms which is being removed in
another JIRA.
2. This is only the 'core functionality' patch - you can use Observers in
your ensembles as of this commit (and I would love to hear experiences from
people who do so), but there's more to come in terms of optimisations and
the sanding down of some rough edges. In particular, the configuration is a
bit cumbersome and we already have a JIRA open to address that.

cheers,
Henry


On Wed, Nov 18, 2009 at 3:35 PM, Gustavo Niemeyer wrote:

> 
> r881882 | mahadev | 2009-11-18 13:06:39 -0600 (Wed, 18 Nov 2009) | 1 line
>
> ZOOKEEPER-368. Observers: core functionality (henry robinson via mahadev)
>
>
> Sweet!  Congratulations, and thanks Henry.
>
>
> --
> Gustavo Niemeyer
> http://niemeyer.net
>


Re: Authentication, encryption, and dynamic membership

2009-11-10 Thread Henry Robinson
Hi Gustavo -

I can't speak as to the other JIRAs, but ZK-107 (dynamic membership) is
still being worked on by me. This is a very large change to the ZK codebase,
so I can't see it getting in really before 4.0, although the committers may
view things differently.

If you have a pressing need for the feature, the mailing list archives
contain suggestions of how to change your cluster on the fly by doing a
rolling restart of your nodes with a new configuration.

Henry

On Tue, Nov 10, 2009 at 12:57 PM, Gustavo Niemeyer wrote:

> Dear ZooKeepers,
>
> I'm quite interested in the features related to inter-server
> authentication, encryption, and dynamic membership.  I *think* the
> right JIRAs are 107 and 236.  Are these features likely to see some
> activity in the upcoming releases, according to existing roadmaps?
>
> Thanks in advance,
>
> --
> Gustavo Niemeyer
> http://niemeyer.net
>


Re: ApacheCon 2009 Meetup talk, also ZooKeeper packages available for download

2009-11-08 Thread Henry Robinson
At the same event, I gave a presentation on two JIRAs I've been working on -
observers and dynamic ensembles. The slides are up on Slideshare here:
http://www.slideshare.net/cloudera/zookeeper-futures, and I will try to get
them uploaded to the wiki page.

I was also able to announce ZooKeeper packages as part of Cloudera's
Distribution for Hadoop - a free, stable, package-based distribution of
Mapred, HDFS, Hive, Pig and now ZooKeeper that we maintain at Cloudera for
users who want the convenience of using their package-management tools and
service frameworks to run Hadoop projects. The latest release of the
ZooKeeper RPMs is here: http://archive.cloudera.com/redhat/cdh/unstable/,
with tarball and Debian packages to follow shortly. We're hoping to add a
lot of improvements over the coming weeks. If you have any questions about
the packages, please shoot me a private e-mail unless you think the whole
list would be interested.

cheers,
Henry

On Fri, Nov 6, 2009 at 6:40 PM, Mahadev Konar  wrote:

> Hi all,
>  I had given a brief overview of ZooKeeper and BookKeeper at ApacheCon
> meetup this week. The talk is uploaded at
>
> http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations
>
> In case you guys are interested.
>
> Thanks
> mahadev
>
>


ZooKeeper talks at the post-Apachecon Hadoop Meetup tonight

2009-11-05 Thread Henry Robinson
Apologies for the late notice, but I wanted to advertise a pair of short
talks by Mahadev and myself at the Hadoop Meetup tonight in Oakland:

Mahadev will be giving a broad overview of ZooKeeper and talking about its
uses inside and outside of Yahoo!, and I will be talking about upcoming
features that I've been working on including observers and dynamic
ensembles.

There will be other talks as well on HDFS and other topics.

Rumour has it there will be beer (but don't hold me to that!).

The meetup is at the Mariott Oakland City Center. Google map is

here:
http://maps.google.com/maps?oe=utf-8&client=firefox-a&ie=UTF8&q=merriot+oakland&fb=1&gl=us&hq=merriot&hnear=oakland&cid=0,0,204705793290918968&ei=k3jzSqbJA4WksgPZ58gd&ved=0CAwQnwIwAA&ll=37.803087,-122.272575&spn=0.009766,0.022724&z=16&iwloc=A

I'll post my slides in the next day or so. Hope some of you can make it!

cheers,
Henry


Henry Robinson
Software Engineer
Cloudera


Re: API for node entry to the cluster.

2009-11-05 Thread Henry Robinson
Hi -

Yes there are future plans. See
https://issues.apache.org/jira/browse/ZOOKEEPER-107. I have code written for
this that works but is not rock-solid yet.

cheers,
Henry

On Thu, Nov 5, 2009 at 11:02 AM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

> Hi All
>
> Is it possible to remove nodes and add nodes dynamically to the ZK cluster
> via API? Any plans in the future to do this?
>
> TIA
> A
>


Re: Cluster Configuration Issues

2009-10-22 Thread Henry Robinson
yeah - thought this was it: you've missed the forward slash on
home/mark/zookeeper (this turned up on your exception message).

On Thu, Oct 22, 2009 at 2:55 PM, Mark Vigeant
wrote:

> Yeah I just figured out the problem with zoocfg.py
>
> I am running as the same user who created myid. Here's my config:
>
> zoo.cfg
>
> tickTime-2000
> dataDir=home/mark/zookeeper
> clientPort=2181
> initLimit=5
> syncLimit=2
> server.1= hermes:2888:3888
> server.2= leela:2888:3888
>
> on the machines hermes and leela I've put myid files in
> /home/mark/zookeeper
> with the numbers 1 and 2 respectively
> -Original Message-
> From: Henry Robinson [mailto:he...@cloudera.com]
> Sent: Thursday, October 22, 2009 5:43 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Cluster Configuration Issues
>
> Hi Mark -
>
> The Python error relates to not being able to find the zoocfg module - is
> zoocfg.py in the same directory as zkconf.py?
>
> Another couple of questions - are you running zookeeper as the same user
> who
> created myid? Can you post your entire configuration file please - copy and
> paste?
>
> Henry
>
> On Thu, Oct 22, 2009 at 2:32 PM, Mark Vigeant
> wrote:
>
> > Before I make the Jira, I am trying to go with Ted's advice to use the
> > python script.
> >
> > Unfortunately I'm relatively unfamiliar with python so I'm having trouble
> > running it.
> >
> > When I execute "Python zkconf.py" on the command line it tells me:
> > Traceback (most recent call last):
> >   File "zkconf.py", line 27, in 
> >from zoocfg import zoocfg
> > ImportError: No module named zoocfg
> >
> > The same error comes when I try to call zkcfg.py from the python
> interface
> > and when I try running
> > Python zkconf.py -help /home/hadoop/zookeeper-3.2.1/ /home/hadoop (as I
> > gathered from the Usage). Any suggestions?
> >
> > Also, I've been using zookeeper 3.2.1
> > -Original Message-
> > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > Sent: Thursday, October 22, 2009 4:33 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Cluster Configuration Issues
> >
> > Try Patrick's utility for creating the config files and compare the
> result
> > to your hand-made files.
> >
> > On Thu, Oct 22, 2009 at 1:04 PM, Mark Vigeant
> > wrote:
> >
> > > The file contains the number 1 and nothing else. My other node has the
> > > number 2 (I only have 2 machines right now, I know it makes more sense
> to
> > > run an odd number of zookeeper nodes but I just want to make sure it
> > works
> > > first). Any suggestions?
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>


Re: Cluster Configuration Issues

2009-10-22 Thread Henry Robinson
Hi Mark -

The Python error relates to not being able to find the zoocfg module - is
zoocfg.py in the same directory as zkconf.py?

Another couple of questions - are you running zookeeper as the same user who
created myid? Can you post your entire configuration file please - copy and
paste?

Henry

On Thu, Oct 22, 2009 at 2:32 PM, Mark Vigeant
wrote:

> Before I make the Jira, I am trying to go with Ted's advice to use the
> python script.
>
> Unfortunately I'm relatively unfamiliar with python so I'm having trouble
> running it.
>
> When I execute "Python zkconf.py" on the command line it tells me:
> Traceback (most recent call last):
>   File "zkconf.py", line 27, in 
>from zoocfg import zoocfg
> ImportError: No module named zoocfg
>
> The same error comes when I try to call zkcfg.py from the python interface
> and when I try running
> Python zkconf.py -help /home/hadoop/zookeeper-3.2.1/ /home/hadoop (as I
> gathered from the Usage). Any suggestions?
>
> Also, I've been using zookeeper 3.2.1
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Thursday, October 22, 2009 4:33 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Cluster Configuration Issues
>
> Try Patrick's utility for creating the config files and compare the result
> to your hand-made files.
>
> On Thu, Oct 22, 2009 at 1:04 PM, Mark Vigeant
> wrote:
>
> > The file contains the number 1 and nothing else. My other node has the
> > number 2 (I only have 2 machines right now, I know it makes more sense to
> > run an odd number of zookeeper nodes but I just want to make sure it
> works
> > first). Any suggestions?
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: Cluster Configuration Issues

2009-10-20 Thread Henry Robinson
Hi Mark -

You should create the myid file yourself, as you have done. What errors are
you seeing that lead you to think the id is not being read correctly?

cheers,
Henry

On Tue, Oct 20, 2009 at 10:12 AM, Mark Vigeant  wrote:

> Hey-
>
> So I'm trying to run hbase on 4 nodes, and in order to do that I need to
> run zookeeper in replicated mode (I could have hbase run the quorum for me,
> but it's suggested that I don't).
>
> I have an issue though.  For some reason the id I'm assigning each server
> in the file "myid" in the assigned data directory is not getting read. I
> feel like another id is being created and put somewhere else. Does anyone
> have any tips on starting a zookeeper quorum? Do I create the myid file
> myself or do I edit one once it is created by zookeeper?
>
> This is what my  config looks like:
> ticktime=2000
> dataDir=/home/hadoop/zookeeper
> clientPort=2181
> initLimit=5
> syncLimit=2
> server.1=hadoop1:2888:3888
>
> The name of my machine is hadoop1, with user name hadoop. In
> /home/hadoop/zookeeper I've created a myid file with the number 1 in it.
>
> Mark Vigeant
> RiskMetrics Group, Inc.
>
>


Re: UnsupportedClassVersionError when building zkpython

2009-10-12 Thread Henry Robinson
Hi Steven -

I also see that problem if I build on my Mac sometimes. I'm looking into a
proper fix, but for now you can do:

ant compile
sudo python src/python/setup.py install

to build and install manually. If you have a moment, can you let me know
which ant you are using? (ant -version)

Thanks for bringing this up!

Henry

On Mon, Oct 12, 2009 at 9:06 PM, Steven Wong  wrote:

> Any idea how I can get it to build? ZooKeeper 3.2.1 (tarball release) on
> Mac OS X 10.5.8. Thanks.
>
>
>
> sw...@lgmac-swong:~/lib/zookeeper/src/contrib/zkpython 9173> sudo ant
> install
>
> Buildfile: build.xml
>
>
>
> BUILD FAILED
>
> java.lang.UnsupportedClassVersionError: Bad version number in .class
> file
>
>at java.lang.ClassLoader.defineClass1(Native Method)
>
>at
> java.lang.ClassLoader.defineClass(ClassLoader.java:675)
>
>at
> org.apache.tools.ant.AntClassLoader.defineClassFromData(AntClassLoader.j
> ava:1146)
>
>at
> org.apache.tools.ant.AntClassLoader.getClassFromStream(AntClassLoader.ja
> va:1324)
>
>at
> org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader
> .java:1388)
>
>at
> org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1341)
>
>at
> org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1088)
>
>at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
>
>at
> org.apache.tools.ant.taskdefs.Available.checkClass(Available.java:446)
>
>at
> org.apache.tools.ant.taskdefs.Available.eval(Available.java:273)
>
>at
> org.apache.tools.ant.taskdefs.Available.execute(Available.java:225)
>
>at
> org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)
>
>at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown
> Source)
>
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> Impl.java:25)
>
>at java.lang.reflect.Method.invoke(Method.java:585)
>
>at
> org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:1
> 06)
>
>at org.apache.tools.ant.Task.perform(Task.java:348)
>
>at org.apache.tools.ant.Target.execute(Target.java:357)
>
>at
> org.apache.tools.ant.helper.ProjectHelper2.parse(ProjectHelper2.java:142
> )
>
>at
> org.apache.tools.ant.ProjectHelper.configureProject(ProjectHelper.java:9
> 3)
>
>at org.apache.tools.ant.Main.runBuild(Main.java:743)
>
>at org.apache.tools.ant.Main.startAnt(Main.java:217)
>
>at
> org.apache.tools.ant.launch.Launcher.run(Launcher.java:257)
>
>at
> org.apache.tools.ant.launch.Launcher.main(Launcher.java:104)
>
>
>
> Total time: 0 seconds
>
> sw...@lgmac-swong:~/lib/zookeeper/src/contrib/zkpython 9178> sudo javac
> -version
>
> javac 1.6.0_07
>
>
>
>


Re: Error running contrib tests

2009-09-23 Thread Henry Robinson
Hi Erik -

Notwithstanding the test issue (which, as Patrick says, is a bit tricky to
get running and caused by a slightly different issue) - it seems that Python
can't find the C ZooKeeper library which is a requirement for the Python
module.

If you compiled the C library as usual then then libzookeeper_mt.so.2 should
be in /usr/local/lib. Can you check this? If not, we need to find out where
it's getting put. It seems like it's not in the library path. Then try doing
LD_LIBRARY_PATH=/path/to/libzookeeper_mt.so.2 python and trying the import
zookeeper step again. Also, if you saw any errors when building the python
module or C module, send them along.

Let me know how you get on!

Henry


On Wed, Sep 23, 2009 at 12:07 AM, Patrick Hunt  wrote:

> Erik, I think you ran into this:
> https://issues.apache.org/jira/browse/ZOOKEEPER-420
>
> Henry Robinson from Cloudera (cc'd) created the zkpython contrib, ccing him
> if he has a better way, but here's how I am able to run the tests w/o
> installing:
>
> I get around it by compiling src/c and then changing
> src/contrib/zkpython/src/python/setup.py
>
> from:
> library_dirs=["/usr/local/lib"]
> to:
> library_dirs=["/src/c/.libs"]
>
> then "ant compile" zkpython, move the zookeeper.so from build/contrib into
> zkpython/src/test, then I run the tests as:
>
>  LD_LIBRARY_PATH=/src/c/.libs/. ant test
>
>
> There are a few issues pending with zkpython, I'm hoping Henry can get back
> and address these for the next release.
> (such as https://issues.apache.org/jira/browse/ZOOKEEPER-510)
>
> Regards,
>
> Patrick
>
>
> Erik Holstad wrote:
>
>> Hi!
>> I am trying out the python bindings and I followed the guide on
>>
>> http://www.cloudera.com/blog/2009/05/28/building-a-distributed-concurrent-queue-with-apache-zookeeper/
>> Everything worked fine until the last step:
>>
>> Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:56)
>> [GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>
>>>  import zookeeper
>>>>>
>>>> Traceback (most recent call last):
>>  File "", line 1, in 
>> ImportError: libzookeeper_mt.so.2: cannot open shared object file: No such
>> file or directory
>>
>> I figured that I did something wrong in my setup, so I tried to run the
>> contrib test and got:
>>
>> python-test:
>> [exec] Running src/test/clientid_test.py
>> [exec] Traceback (most recent call last):
>> [exec]   File "src/test/clientid_test.py", line 21, in 
>> [exec] import zookeeper, zktestbase
>> [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
>> file: No such file or directory
>> [exec] Running src/test/connection_test.py
>> [exec] Traceback (most recent call last):
>> [exec]   File "src/test/connection_test.py", line 21, in 
>> [exec] import zookeeper, zktestbase
>> [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
>> file: No such file or directory
>> [exec] Running src/test/create_test.py
>> [exec] Traceback (most recent call last):
>> [exec]   File "src/test/create_test.py", line 19, in 
>> [exec] import zookeeper, zktestbase, unittest, threading
>> [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
>> file: No such file or directory
>> [exec] Running src/test/delete_test.py
>> [exec] Traceback (most recent call last):
>> [exec]   File "src/test/delete_test.py", line 19, in 
>> [exec] import zookeeper, zktestbase, unittest, threading
>> [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
>> file: No such file or directory
>> [exec] Running src/test/exists_test.py
>> [exec] Traceback (most recent call last):
>> [exec]   File "src/test/exists_test.py", line 19, in 
>> [exec] import zookeeper, zktestbase, unittest, threading
>> [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
>> file: No such file or directory
>> [exec] Running src/test/get_set_test.py
>> [exec] Traceback (most recent call last):
>> [exec]   File "src/test/get_set_test.py", line 19, in 
>> [exec] import zookeeper, zktestbase, unittest, threading
>> [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
>> file: No such file or directory
>>
>> BUILD FAILED
>> /home/erik/src/zookeeper-3.2.1/src/contrib/build.xml:48: The following
>> error
>> occurred while executing this line:
>> /home/erik/src/zookeeper-3.2.1/src/contrib/zkpython/build.xml:63: exec
>> returned: 1
>>
>>
>> I ran this test from zookeeper/src/contrib with ant test
>>
>> Not sure if I'm doing something wrong or if this is a bug?
>>
>> Regards Erik
>>
>>


Re: Leader Elections

2009-07-20 Thread Henry Robinson
On Mon, Jul 20, 2009 at 7:50 PM, Todd Greenwood
wrote:

> Flavio, Ted, Henry, Scott, this would perfectly well for my use case
> provided:
>
> SINGLE ENSEMBLE:
>GROUP A : ZK Servers w/ read/write AND Leader Elections
>GROUP B : ZK Servers w/ read/write W/O Leader Elections
>
> So, we can craft this via Observers and Hiererarchial Quorum groups?
> Great. Problem solved.
>
> When will this be production ready? :o)
>

Looks to me like you don't even need hierarchical quorums for this - make
everyone in group B an Observer and you're done.

I've been working on this feature. Recently we've been discussing a
proof-of-concept patch on the JIRA. I have nearly finished a less rough
patch which I will submit for discussion and potentially commit this week.
At that point it would be extremely helpful if you could help test the
patch, and you can start considering it for production. To get into trunk I
will have to write a comprehensive test suite and update the documentation,
and then making sure all the boxes are ticked and no regressions are thrown
up can take a little while.

Henry




>
> 
>
> Scott brought up a multi-feature that is very interesting for me.
> Namely:
>
> 1. Offline ZK servers that sync & merge on reconnect
>
> The offline servers seems conceptually simple, it's kind of like a
> messaging system. However, the merge and resolve step when two servers
> reconnect might be challenging. Cool idea though.
>
> 2. Partial memory graph subscriptions
>
> The second idea is partial memory graph subscriptions. This would enable
> virtual ensembles to interract on the same physical ensemble. For my use
> case, this would prevent unnecessary cross talk between nodes on a WAN,
> allowing me to define the subsets of the memory graph that need to be
> replicated, and to whom. This would be a huge scalability win for WAN
> use cases.
>
> -Todd
>
> -Original Message-
> From: Scott Carey [mailto:sc...@richrelevance.com]
> Sent: Monday, July 20, 2009 11:00 AM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Leader Elections
>
> Observers would be awesome especially with a couple enhancements /
> extensions:
>
> An option for the observers to enter a special state if the WAN link
> goes down to the "master" cluster.  A read-only option would be great.
> However, allowing certain types of writes to continue on a limited basis
> would be highly valuable as well.  An observer could "own" a special
> node and its subnodes.  Only these subnodes would be writable by the
> observer when there was a session break to the master cluster, and the
> master cluster would take all the changes when the link is
> reestablished.  Essentially, it is a portion of the hierarchy that is
> writable only by a specitfic observer, and read-only for others.
> The purpose of this would be for when the WAN link goes down to the
> "master" ZKs for certain types of use cases - status updates or other
> changes local to the observer that are strictly read-only outside the
> Observer's 'realm'.
>
>
> On 7/19/09 12:16 PM, "Henry Robinson"  wrote:
>
> You can. See ZOOKEEPER-368 - at first glance it sounds like observers
> will
> be a good fit for your requirements.
>
> Do bear in mind that the patch on the jira is only for discussion
> purposes;
> I would not consider it currently fit for production use. I hope to put
> up a
> much better patch this week.
>
> Henry
>
> On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning 
> wrote:
>
> > Can you submit updates via an observer?
> >
> > On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira 
> > wrote:
> >
> > > 2- Observers: you could have one computing center containing an
> ensemble
> > > and observers around the edge just learning committed values.
> >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>
>


Re: Leader Elections

2009-07-20 Thread Henry Robinson
;
>
>
> >
> > -Todd
> >
> > -Original Message-
> > From: Scott Carey [mailto:sc...@richrelevance.com]
> > Sent: Monday, July 20, 2009 11:00 AM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Leader Elections
> >
> > Observers would be awesome especially with a couple enhancements /
> > extensions:
> >
> > An option for the observers to enter a special state if the WAN link
> > goes down to the "master" cluster.  A read-only option would be great.
> > However, allowing certain types of writes to continue on a limited basis
> > would be highly valuable as well.  An observer could "own" a special
> > node and its subnodes.  Only these subnodes would be writable by the
> > observer when there was a session break to the master cluster, and the
> > master cluster would take all the changes when the link is
> > reestablished.  Essentially, it is a portion of the hierarchy that is
> > writable only by a specitfic observer, and read-only for others.
> > The purpose of this would be for when the WAN link goes down to the
> > "master" ZKs for certain types of use cases - status updates or other
> > changes local to the observer that are strictly read-only outside the
> > Observer's 'realm'.
> >
> >
> > On 7/19/09 12:16 PM, "Henry Robinson"  wrote:
> >
> > You can. See ZOOKEEPER-368 - at first glance it sounds like observers
> > will
> > be a good fit for your requirements.
> >
> > Do bear in mind that the patch on the jira is only for discussion
> > purposes;
> > I would not consider it currently fit for production use. I hope to put
> > up a
> > much better patch this week.
> >
> > Henry
> >
> > On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning 
> > wrote:
> >
> >> Can you submit updates via an observer?
> >>
> >> On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira 
> >> wrote:
> >>
> >>> 2- Observers: you could have one computing center containing an
> > ensemble
> >>> and observers around the edge just learning committed values.
> >>
> >>
> >>
> >>
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >>
> >
> >
>
>


Re: Leader Elections

2009-07-19 Thread Henry Robinson
You can. See ZOOKEEPER-368 - at first glance it sounds like observers will
be a good fit for your requirements.

Do bear in mind that the patch on the jira is only for discussion purposes;
I would not consider it currently fit for production use. I hope to put up a
much better patch this week.

Henry

On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning  wrote:

> Can you submit updates via an observer?
>
> On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira 
> wrote:
>
> > 2- Observers: you could have one computing center containing an ensemble
> > and observers around the edge just learning committed values.
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: zookeeper on ec2

2009-07-06 Thread Henry Robinson
On Mon, Jul 6, 2009 at 10:16 PM, Ted Dunning  wrote:

> No.  This should not cause data loss.


> As soon as ZK cannot replicate changes to a majority of machines, it
> refuses
> to take any more changes.  This is old ground and is required for
> correctness in the face of network partition.  It is conceivable (barely)
> that *exactly* the minority that were behind were the survivors, but this
> is
> almost equivalent to a complete failure of the cluster choreographed in
> such
> a way that a few nodes come back from the dead just afterwards.  That could
> cause the state to not include some "completed" transactions to disappear,
> but at this level of massive failure, we have the same issues with any
> cluster.
>

Effectively, EC2 does not introduce any new failure modes but potentially
exacerbates some existing ones. If a majority of EC2 nodes fail (in the
sense that their hard drive images cannot be recovered), there is no way to
restart the cluster, and persistence is lost. As you say, this is highly
unlikely. If, for some reason, the quorums are set such that only a single
node failure could bring down the quorum (bad design, but plausible), this
failure is more likely.

EC2 just ups the stakes - crash failures are now potentially more dangerous
(bugs, packet corruption, rack local hardware failures etc all could cause
crash failures). It is common to assume that, notwithstanding a significant
physical event that wipes a number of hard drives, writes that are written
stay written. This assumption is sometimes false given certain choices of
filesystem. EC2 just gives us a few more ways for that not to be true.

I think it's more possible than one might expect to have a lagging minority
left behind - say they are partitioned from the majority by a malfunctioning
switch. They might all be lagging already as a result. Care must be taken
not to bring up another follower on the minority side to make it a majority,
else there are split-brain issues as well as the possibility of lost
transactions. Again, not *too* likely to happen in the wild, but these
permanently running services have a nasty habit of exploring the edge
cases...


>
> To be explicit, you can cause any ZK cluster to back-track in time by doing
> the following:
>
...

>
> f) add new members of the cluster


Which is why care needs to be taken that the ensemble can't be expanded with
a current quorum. Dynamic membership doesn't save us when a majority fails -
the existence of a quorum is a liveness condition for ZK. To help with the
liveness issue we can sacrifice a little safety (see, e.g. vector clock
ordered timestamps in Dynamo), but I think that ZK is aimed at safety first,
liveness second. Not that you were advocating changing that, I'm just
articulating why correctness is extremely important from my perspective.

Henry


>
>
> At this point, you will have lost the transactions from (b), but I really,
> really am not going to worry about this happening either by plan or by
> accident.  Without steps (e) and (f), the cluster will tell you that it
> knows something is wrong and that it cannot elect a leader.  If you don't
> have *exact* coincidence of the survivor set and the set of laggards, then
> you won't have any data loss at all.
>
> You have to decide if this is too much risk for you.  My feeling is that it
> is OK level of correctness for conventional weapon fire control, but not
> for
> nuclear weapons safeguards.  Since my apps are considerably less sensitive
> than either of those, I am not much worried.


>
> On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson 
> wrote:
>
> > It seems like there is a
> > correctness issue: if a majority of servers fail, with the remaining
> > minority lagging the leader for some reason, won't the ensemble's current
> > state be forever lost?
> >
>


Re: zookeeper on ec2

2009-07-06 Thread Henry Robinson
On Mon, Jul 6, 2009 at 7:38 PM, Ted Dunning  wrote:

>
> I think that the misunderstanding is that this on-disk image is critical to
> cluster function.  It is not critical because it is replicated to all
> cluster members.  This means that any member can disappear and a new
> instance can replace it with no big cost other than the temporary load of
> copying the current snapshot from some cluster member.
>

This is an interesting way of doing things. It seems like there is a
correctness issue: if a majority of servers fail, with the remaining
minority lagging the leader for some reason, won't the ensemble's current
state be forever lost? This is akin to a majority of servers failing and
never recovering. ZK relies on the eventual liveness of a majority of its
servers; with EC2 it seems possible that that property might not be
satisfied.

(For majority, you can read 'quorum' under the flexible quorums scheme;
perhaps there is a way to devise a quorum scheme suitable for elastic
computing...)

Henry



>
> On Mon, Jul 6, 2009 at 11:33 AM, Mahadev Konar  >wrote:
>
> >  In the documentation of zookeeper, I have read that
> > > zookeeper saves snapshots of the in-memory data in the file system. Is
> > > that needed for recovery? Logically, it would be much easier for me if
> > > this is not the case.
> > Yes, zookeeper keeps persistent state on disk. This is used for recovery
> > and
> > correctness of zookeeper.
>


Re: Dynamic servers addition and persistent storage.

2009-07-01 Thread Henry Robinson
Hi Gustavo -

I hope to have a patch for both fairly soon. I should at least get ZK-368 to
a workable position this week, and ZK-107 will hopefully not be an enormous
amount of work on top of that. However, there doubtless be some slack time
for picking up bugs etc. before it gets committed as it will be a reasonably
sized patch.

Out of interest, what's your application for this?

Henry

On Wed, Jul 1, 2009 at 4:01 PM, Gustavo Niemeyer wrote:

> Hey Henry,
>
> > We (and myself in particular) are working on dynamic cluster membership,
> see
> > https://issues.apache.org/jira/browse/ZOOKEEPER-107 and the related
> > https://issues.apache.org/jira/browse/ZOOKEEPER-368.
>
> That's fantastic news!  How do you feel this is going so far?  We
> might have an application for this pretty soon.
>
> --
> Gustavo Niemeyer
> http://niemeyer.net
>


Re: Dynamic servers addition and persistent storage.

2009-07-01 Thread Henry Robinson
Hi Maxime -

When a quorum of ZooKeeper servers have failed, the service stops being
available - you cannot write or read to any item. Once a quorum returns to
operation, the ensemble recovers automatically and continues where it left
off. There is the same requirement that a quorum of servers must see every
write.

We (and myself in particular) are working on dynamic cluster membership, see
https://issues.apache.org/jira/browse/ZOOKEEPER-107 and the related
https://issues.apache.org/jira/browse/ZOOKEEPER-368.

Henry

On Wed, Jul 1, 2009 at 3:07 PM, Maxime Caron  wrote:

> I was investigating scalaris (http://code.google.com/p/scalaris/) but
> found
> it does not support a persistent storage.
> In their faq they say it cant be done because they assume that a majority
> of
> the replicas of an item is always available.
> If this precondition is violated, a majority of the nodes with replicas of
> an item x is not available, the item cannot be changed. It is lost.
> Persistent storage cannot help directly.
> So i would like to understand if zookeeper work the same way or their is a
> recovery model for  when the majority of the goes down and then back up.
>
>
> Unlike zookeeper with scalaris servers can be added or removed on the fly
> without any service downtime.
> From what i can understand in zookeeper you need to have a fixed server
> list
> that every body share.
> Would it be possible to add on the fly server addition and removal to
> zookeeper?
>


Re: General Question about Zookeeper

2009-06-25 Thread Henry Robinson
What else do you want to use ZK for - just leader election? It doesn't
require so much a centralised server (which implies kind of a single point
of failure) as a small amount of fixed infrastructure. If you have a highly
dynamic network - an ad-hoc network like a social net - ZK will likely not
be appropriate. There are leader election algorithms that work better in
totally ad-hoc networks, and other co-ordination models that are better
suited. In particular, you may not want persistence in the sense that later
instances of a consensus algorithm might not need to see the results of
previous ones, removing the need to keep logs synchronised.

However, if you have five or so servers that you can dedicate to
coordination, ZooKeeper should work very well. I'm really curious about your
use case - is there more you can explain?

Henry

On Thu, Jun 25, 2009 at 7:16 PM, Harold Lim  wrote:

>
> Hi Gustavo,
>
> Actually, in my case, we have a fully decentralized service. Something like
> where you have users in a social network. Originally, we were thinking of
> using a distributed consensus algorithm (e.g., Paxos) to perform some
> functionalities (e.g., leader election).
>
> Then, I read about ZooKeeper and was thinking of using ZooKeeper for leader
> election instead. However, that means that we're introducing a "central"
> server/service to the architecture.
>
> Currently, I'm just thinking of some of the original functionalities and
> how much of these functionalities I can offload to ZooKeeper, without
> breaking the original privacy/security motivation.
>
>
> -Harold
>
>
>
>
> --- On Thu, 6/25/09, Gustavo Niemeyer  wrote:
>
> > From: Gustavo Niemeyer 
> > Subject: Re: General Question about Zookeeper
> > To: zookeeper-user@hadoop.apache.org
> > Date: Thursday, June 25, 2009, 1:59 PM
> > Hey Harold,
> >
> > > I am interested in a security aspect of zookeeper,
> > where the clients and the servers don't necessarily belong
> > to the same "group". If a client creates a znode in the
> > zookeeper? Can the person, who owns the zookeeper server,
> > simply look at its filesystem and read the data
> > (out-of-band, not using a client, simply browsing the file
> > system of the machine hosting the zookeeper server)?
> >
> > Yes, absolutely.  You could certainly encrypt the data
> > that goes
> > through the ZooKeeper server, but since ZooKeeper is
> > supposed to be
> > doing coordination work, I think that if you don't trust
> > the server,
> > the whole situation might get a bit awkward.  I'm
> > curious about your
> > use case, since I'm pondering about doing something where
> > clients
> > don't necessarily trust other clients or machines in the
> > same network
> > (or even different users in the same machine), thus might
> > require
> > additional tighting up, but if you don't trust the server
> > itself, that
> > may be tricky.  Please note that ZooKeeper isn't meant
> > to be used just
> > as a distributed filesystem for storage, but that's
> > probably not your
> > intention anyway.
> >
> > --
> > Gustavo Niemeyer
> > http://niemeyer.net
> >
>
>
>
>


Re: General Question about Zookeeper

2009-06-25 Thread Henry Robinson
Hi Harold,

Each ZooKeeper server stores updates to znodes in logfiles, and periodic
snapshots of the state of the datatree in snapshot files.

A user who has the same permissions as the server will be able to read these
files, and can therefore recover the state of the datatree without the ZK
server intervening. ACLs are applied only by the server; there is no
filesystem-level representation of them.

Henry



On Thu, Jun 25, 2009 at 6:48 PM, Harold Lim  wrote:

>
> Hi All,
>
> How does zookeeper store data/files?
> From reading the doc, the clients can put ACL on files/znodes to limit
> read/write/create of other clients. However, I was wondering how are these
> znodes stored on Zookeeper servers?
>
> I am interested in a security aspect of zookeeper, where the clients and
> the servers don't necessarily belong to the same "group". If a client
> creates a znode in the zookeeper? Can the person, who owns the zookeeper
> server, simply look at its filesystem and read the data (out-of-band, not
> using a client, simply browsing the file system of the machine hosting the
> zookeeper server)?
>
>
> Thanks,
> Harold
>
>
>
>


Re: common client

2009-06-23 Thread Henry Robinson
+1 to this idea. It will be good to have some more focus on examples of how
to build applications using ZK; experiences here will feed back into the
design of the core.

Henry

On Tue, Jun 23, 2009 at 2:23 AM, Mahadev Konar wrote:

> Hi Stefan,
>  This would be a good addition. Feel free to open a jira and contribute the
> code. As Nitay suggested, this can go in to src/recipes/$recipe_name and
> would be quite useful.
>
> thanks
> mahadev
>
>
> On 6/22/09 4:45 PM, "Nitay"  wrote:
>
> > +1. I would be interested in things like this. I think it should be in
> > some contrib/ type thing under zookeeper, like the recipes.
> >
> > On Mon, Jun 22, 2009 at 4:41 PM, Stefan Groschupf wrote:
> >> Hi,
> >>
> >> I wonder if people are interested to work together on a zk client that
> >> support some more functionality than zk offers by default.
> >> Katta has this client and I copied the code into a couple other projects
> as
> >> well but I'm sure it could be better than it is.
> >>
> >>
> http://katta.svn.sourceforge.net/viewvc/katta/trunk/src/main/java/net/sf/katt
> >> a/zk/ZKClient.java?view=markup
> >>
> >> I'm sure other would benefit from such a client.
> >>
> >> Some of the feature are:
> >> + Connect
> >> + Data and StateChangeListener - subscribe once, get events until
> >> unsubscribe
> >> + Threadsafe
> >>
> >> It is not a lot of code but I'm just tired to have it duplicated so many
> >> times.
> >> Anyone interested to join in?  Or is there something like this already?
> >> I could just copy this to a github project.
> >>
> >> Stefan
> >>
> >>
>
>


Re: zookeeper.getChildren asynchronous callback

2009-06-10 Thread Henry Robinson
Hi Satish -

As you've found out, you can set multiple identical watches per znode - the
zookeeper client will not detect identical watches in case you really meant
to call them several times. There's no way currently, as far as I know, to
clear the watches once they've been set. So your options are either to avoid
repeatedly setting them by detecting whether getChildren is a repeat call,
or by dealing with multiple invocations on the callback path and not doing
anything once you've established you're no longer interested.

It might well make sense to add a clearWatches(path) call to the API, which
would be useful particularly for clients where callbacks are expensive and
require a context switch (which I think is true for all clients right now!).

Henry

On Wed, Jun 10, 2009 at 8:05 PM, Satish Bhatti  wrote:

> I am using the asynchronous (callback) version of zookeeper.getChildren().
>  That call returns immediately, I then wait for a certain time interval for
> nodes to appear, and if not I exit the method that made the
> zookeeper.getChildren()
> call.  Later on, a node gets added under that node and I see in my logfile
> that the Watcher.process() callback that I set above gets called.  Now if I
> make 10 failed attempts to get a node using the above technique, and at
> some
> later time a node does get added, I see in the logfile that the
> Watcher.process() ends up being called 10 times!  Of course by this time I
> have totally lost interest in those callbacks.  Question:  Is there a way
> to
> remove that asynchronous callback?  i.e. If I make a asynchronous
> zookeeper.getChildren()
> call, wait time t, give up, at that point can I remove the async callback?
> Satish
>


Re: ConnectionLoss (node too big?)

2009-06-03 Thread Henry Robinson
On Wed, Jun 3, 2009 at 5:57 PM, Eric Bowman  wrote:

> At some point I'll spend some time understanding how this really affects
> latency in my case ... I'm keeping just a handful of things that are
> about 10M in the ensemble, so the memory footprint is no problem.  But
> the network bandwidth could be ... I'll check it out.
>

If you do get some performance numbers, maybe you could share them, perhaps
on the JIRA Patrick linked to earlier?

cheers,

Henry


>
> Thanks,
> Eric
>
> --
> Eric Bowman
> Boboco Ltd
> ebow...@boboco.ie
> http://www.boboco.ie/ebowman/pubkey.pgp
> +35318394189/+353872801532
>
>


Re: ConnectionLoss (node too big?)

2009-06-03 Thread Henry Robinson
On Wed, Jun 3, 2009 at 5:27 PM, Eric Bowman  wrote:

>
> Anybody have any experience popping this up a bit bigger?  What kind of
> bad things happen?
>

I don't have personal experience of upping this restriction. However, my
understanding is that if data sizes get large, writing them to network and
disk quickly becomes the bottleneck. Since ZK (presumably) has to guarantee
that writes hit the disk on at least a quorum of followers, the time taken
to process lots of large writes is going to be bounded from below by the
time it takes at least one node to write them all serially. This then
affects ZK's performance.

Henry


>
> Thanks,
> Eric
>
> --
> Eric Bowman
> Boboco Ltd
> ebow...@boboco.ie
> http://www.boboco.ie/ebowman/pubkey.pgp
> +35318394189/+353872801532
>
>


Re: a question about Zookeep

2009-05-13 Thread Henry Robinson
Hi -

This is designed behaviour. In the latest version, the exception
thrown will be labeled "Responded to info probe". The server
disconnects connections that send four-letter commands deliberately -
I'm guessing because these tend to be one-shot commands and keeping a
socket around indefinitely is wasteful.

If you want to persist your connection you must serialise a connection
request and then negotiate the connection protocol with the server.
This is quite tricky to do by hand! The Java / C client shells support
this though (although they don't support the four-letter word commands
like 'ruok') and are probably easier to use if you want to send
commands interactively.

Hope this helps,

Henry

On Wed, May 13, 2009 at 2:26 AM, Qian Ye  wrote:
> Hi guys,
>
> I have a question about connecting to zookeeper by nc or telnet. I ran
> zookeeper in Multi-Server mode, and connected to the server using $nc
> 127.0.0.1 2181. So far, all works. Then I tried command "stat", it showed
> the following:
>
> Zookeeper version: 3.1.1-755636, built on 03/18/2009 16:52 GMT
> Clients:
>  /127.0.0.1:32818[1](queued=0,recved=0,sent=0)
>
> Latency min/avg/max: 1/2/3
> Received: 217
> Sent: 218
> Outstanding: 0
> Zxid: 0x4
> Mode: follower
> Node count: 4
> 2009-05-13 17:16:51,615 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@417]
> - Exception causing close of session 0x0 due to java.io.IOException: closing
> 2009-05-13 17:16:51,615 - INFO  [NIOServerCxn.Factory:2181:nioserverc...@752]
> - closing session:0x0 NIOServerCnxn:
> java.nio.channels.SocketChannel[connected local=/127.0.0.1:2181 remote=/
> 127.0.0.1:32818]
>
> and return to the shell.
>
> The last two lines were issued by log4j (I think, I'm not so familiar with
> things about Java :-p). It seems that java.io.IOException was thrown for
> some reason. I'm not sure about why this happened. Could any one give me
> some help?
>
> --
> With Regards!
>
> Ye, Qian
> Made in Zhejiang University
>