[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-804:
---

Attachment: ZOOKEEPER-804-1.patch

Updated patch to apply against latest trunk (hopefully branch too).

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804-1.patch, 
 ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-804:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

+1 on the second patch. Tested and it seems fine, committed to trunk/branch33 
both.

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804-1.patch, 
 ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Restarting discussion on ZooKeeper as a TLP

2010-10-20 Thread Patrick Hunt
It's been a few days, any thoughts? Acceptable? I'd like to keep moving the
ball forward. Thanks.

Patrick

On Sun, Oct 17, 2010 at 8:43 PM, 明珠刘 redis...@gmail.com wrote:

 +1

 2010/10/14 Patrick Hunt ph...@apache.org

  In March of this year we discussed a request from the Apache Board, and
  Hadoop PMC, that we become a TLP rather than a subproject of Hadoop:
 
  Original discussion
  http://markmail.org/thread/42cobkpzlgotcbin
 
  I originally voted against this move, my primary concern being that we
 were
  not ready to move to tlp status given our small contributor base and
  limited contributor diversity. However I'd now like to revisit that
  discussion/decision. Since that time the team has been working hard to
  attract new contributors, and we've seen significant new contributions
 come
  in. There has also been feedback from board/pmc addressing many of these
  concerns (both on the list and in private). I am now less concerned about
  this issue and don't see it as a blocker for us to move to TLP status.
 
  A second concern was that by becoming a TLP the project would lose it's
  connection with Hadoop, a big source of new users for us. I've been
 assured
  (and you can see with the other projects that have moved to tlp status;
  pig/hive/hbase/etc...) that this connection will be maintained. The
 Hadoop
  ZooKeeper tab for example will redirect to our new homepage.
 
  Other Apache members also pointed out to me that we are essentially
  operating as a TLP within the Hadoop PMC. Most of the other PMC members
  have
  little or no experience with ZooKeeper and this makes it difficult for
 them
  to monitor and advise us. By moving to TLP status we'll be able to govern
  ourselves and better set our direction.
 
  I believe we are ready to become a TLP. Please respond to this email with
  your thoughts and any issues. I will call a vote in a few days, once
  discussion settles.
 
  Regards,
 
  Patrick
 



Re: Fix release 3.3.2 planning, status.

2010-10-20 Thread Patrick Hunt
https://issues.apache.org/jira/browse/ZOOKEEPER-794
https://issues.apache.org/jira/browse/ZOOKEEPER-794I've done a bunch of
testing on a number of macines, could someone take a look at this and +1 it?
(or not) I'd like to get 3.3.2 moving.

Regards,

Patrick

On Mon, Oct 18, 2010 at 9:19 AM, Patrick Hunt ph...@apache.org wrote:

 Hi Camille, unfortunately there's a blocker on 3.3.2 at the moment.
 http://bit.ly/asOSNl I just updated that patch to fix a build issue,
 hopefully one of the committers can review asap.

 Additionally there are a number of other patch available patches attached
 to 3.3.2. I'd like to get those included give everyone's done a bunch of
 work on them. Again, committers need to review/commit/reject appropriately.

 What do ppl think, are we pretty close? Ben/Flavio/Henry/Mahadev please
 review some of the outstanding patches. Coordinate with me if you have
 issues/questions.

 Regards,

 Patrick


 On Mon, Oct 18, 2010 at 7:56 AM, Fournier, Camille F. [Tech] 
 camille.fourn...@gs.com wrote:

 Hi guys,

 Any updates on the 3.3.2 release schedule? Trying to plan a release myself
 and wondering if I'll have to go to production with patched 3.3.1 or have
 time to QA with the 3.3.2 release.

 Thanks,
 Camille

 -Original Message-
 From: Patrick Hunt [mailto:ph...@apache.org]
 Sent: Thursday, September 23, 2010 12:45 PM
 To: zookeeper-dev@hadoop.apache.org
 Subject: Fix release 3.3.2 planning, status.

 Looking at the JIRA queue for 3.3.2 I see that there are two blockers, one
 is currently PA and the other is pretty close (it has a patch that should
 go
 in soon).

 There are a few JIRAs that already went into the branch that are important
 to get out there ASAP, esp ZOOKEEPER-846 (fix close issue found by hbase).

 One issue that's been slowing us down is hudson. The trunk was not passing
 it's hudson validation, which was causing a slow down in patch review.
 Mahadev and I fixed this. However with recent changes to the hudson
 hw/security environment the patch testing process (automated) is broken.
 Giri is working on this. In the mean time we'll have to test ourselves.
 Committers -- be sure to verify RAT, Findbugs, etc... in addition to
 verifying via test. I've setup an additional Hudson environment inside
 Cloudera that also verifies the trunk/branch. If issues are found I will
 report them (unfortunately I can't provide access to cloudera's hudson env
 to non-cloudera employees at this time).

 I'd like to clear out the PAs asap and get a release candidate built.
 Anyone
 see a problem with shooting for an RC mid next week?

 Patrick





Re: Restarting discussion on ZooKeeper as a TLP

2010-10-20 Thread Vishal K
+1.

On Wed, Oct 20, 2010 at 1:50 PM, Patrick Hunt ph...@apache.org wrote:

 It's been a few days, any thoughts? Acceptable? I'd like to keep moving the
 ball forward. Thanks.

 Patrick

 On Sun, Oct 17, 2010 at 8:43 PM, 明珠刘 redis...@gmail.com wrote:

  +1
 
  2010/10/14 Patrick Hunt ph...@apache.org
 
   In March of this year we discussed a request from the Apache Board, and
   Hadoop PMC, that we become a TLP rather than a subproject of Hadoop:
  
   Original discussion
   http://markmail.org/thread/42cobkpzlgotcbin
  
   I originally voted against this move, my primary concern being that we
  were
   not ready to move to tlp status given our small contributor base and
   limited contributor diversity. However I'd now like to revisit that
   discussion/decision. Since that time the team has been working hard to
   attract new contributors, and we've seen significant new contributions
  come
   in. There has also been feedback from board/pmc addressing many of
 these
   concerns (both on the list and in private). I am now less concerned
 about
   this issue and don't see it as a blocker for us to move to TLP status.
  
   A second concern was that by becoming a TLP the project would lose it's
   connection with Hadoop, a big source of new users for us. I've been
  assured
   (and you can see with the other projects that have moved to tlp status;
   pig/hive/hbase/etc...) that this connection will be maintained. The
  Hadoop
   ZooKeeper tab for example will redirect to our new homepage.
  
   Other Apache members also pointed out to me that we are essentially
   operating as a TLP within the Hadoop PMC. Most of the other PMC members
   have
   little or no experience with ZooKeeper and this makes it difficult for
  them
   to monitor and advise us. By moving to TLP status we'll be able to
 govern
   ourselves and better set our direction.
  
   I believe we are ready to become a TLP. Please respond to this email
 with
   your thoughts and any issues. I will call a vote in a few days, once
   discussion settles.
  
   Regards,
  
   Patrick
  
 



Re: implications of netty on client connections

2010-10-20 Thread Patrick Hunt
It may just be the case that we haven't tested sufficiently for this case
(running out of fds) and we need to handle this better even in nio. Probably
by cutting off op_connect in the selector. We should be able to do similar
in netty.

Btw, on unix one can access the open/max fd count using this:
http://download.oracle.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html


Secondly, are you running into a kernel limit or a zk limit? Take a look at
this post describing 1million concurrent connections to a box:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3

specifically:
--

During various test with lots of connections, I ended up making some
additional changes to my sysctl.conf. This was part trial-and-error, I don’t
really know enough about the internals to make especially informed decisions
about which values to change. My policy was to wait for things to break,
check /var/log/kern.log and see what mysterious error was reported, then
increase stuff that sounded sensible after a spot of googling. Here are the
settings in place during the above test:

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 16384 33554432
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_max_tw_buckets = 36
net.core.netdev_max_backlog = 2500
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 1024 65535

--


I'm guessing that even with this, at some point you'll run into a limit in
our server implementation. In particular I suspect that we may start to
respond more slowly to pings, eventually getting so bad it would time out.
We'd have to debug that and address (optimize).

http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
Patrick

On Tue, Oct 19, 2010 at 7:16 AM, Fournier, Camille F. [Tech] 
camille.fourn...@gs.com wrote:

 Hi everyone,

 I'm curious what the implications of using netty are going to be for the
 case where a server gets close to its max available file descriptors. Right
 now our somewhat limited testing has shown that a ZK server performs fine up
 to the point when it runs out of available fds, at which point performance
 degrades sharply and new connections get into a somewhat bad state. Is netty
 going to enable the server to handle this situation more gracefully (or is
 there a way to do this already that I haven't found)? Limiting connections
 from the same client is not enough since we can potentially have far more
 clients wanting to connect than available fds for certain use cases we might
 consider.

 Thanks,
 Camille




Re: Restarting discussion on ZooKeeper as a TLP

2010-10-20 Thread Henry Robinson
+1, thanks for following through with the protocol.

On 20 October 2010 11:02, Vishal K vishalm...@gmail.com wrote:

 +1.

 On Wed, Oct 20, 2010 at 1:50 PM, Patrick Hunt ph...@apache.org wrote:

  It's been a few days, any thoughts? Acceptable? I'd like to keep moving
 the
  ball forward. Thanks.
 
  Patrick
 
  On Sun, Oct 17, 2010 at 8:43 PM, 明珠刘 redis...@gmail.com wrote:
 
   +1
  
   2010/10/14 Patrick Hunt ph...@apache.org
  
In March of this year we discussed a request from the Apache Board,
 and
Hadoop PMC, that we become a TLP rather than a subproject of Hadoop:
   
Original discussion
http://markmail.org/thread/42cobkpzlgotcbin
   
I originally voted against this move, my primary concern being that
 we
   were
not ready to move to tlp status given our small contributor base
 and
limited contributor diversity. However I'd now like to revisit that
discussion/decision. Since that time the team has been working hard
 to
attract new contributors, and we've seen significant new
 contributions
   come
in. There has also been feedback from board/pmc addressing many of
  these
concerns (both on the list and in private). I am now less concerned
  about
this issue and don't see it as a blocker for us to move to TLP
 status.
   
A second concern was that by becoming a TLP the project would lose
 it's
connection with Hadoop, a big source of new users for us. I've been
   assured
(and you can see with the other projects that have moved to tlp
 status;
pig/hive/hbase/etc...) that this connection will be maintained. The
   Hadoop
ZooKeeper tab for example will redirect to our new homepage.
   
Other Apache members also pointed out to me that we are essentially
operating as a TLP within the Hadoop PMC. Most of the other PMC
 members
have
little or no experience with ZooKeeper and this makes it difficult
 for
   them
to monitor and advise us. By moving to TLP status we'll be able to
  govern
ourselves and better set our direction.
   
I believe we are ready to become a TLP. Please respond to this email
  with
your thoughts and any issues. I will call a vote in a few days, once
discussion settles.
   
Regards,
   
Patrick
   
  
 




-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


[jira] Created: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-20 Thread Vishal K (JIRA)
Spurious KeeperErrorCode = Session moved messages
---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K


The sync request does not set the session owner in Request.

As a result, the leader keeps printing:
2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
Path:null Error:KeeperErrorCode = Session moved


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-907:
---

 Priority: Blocker  (was: Major)
Fix Version/s: 3.4.0
   3.3.2

sounds like a blocker to me. can this be easily addressed? I'd still like to 
get 3.3.2 out asap. 

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-820) update c unit tests to ensure zombie java server processes don't cause failure

2010-10-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-820:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

+1, committed to trunk/branch33. Thanks Michi!

 update c unit tests to ensure zombie java server processes don't cause 
 failure
 

 Key: ZOOKEEPER-820
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch, 
 ZOOKEEPER-820.patch, ZOOKEEPER-820.patch


 When the c unit tests are run sometimes the server doesn't shutdown at the 
 end of the test, this causes subsequent tests (hudson esp) to fail.
 1) we should try harder to make the server shut down at the end of the test, 
 I suspect this is related to test failing/cleanup
 2) before the tests are run we should see if the old server is still running 
 and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-20 Thread Vishal K (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal K updated ZOOKEEPER-907:
---

Attachment: ZOOKEEPER-907.patch

attaching patch.

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-20 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923095#action_12923095
 ] 

Patrick Hunt commented on ZOOKEEPER-907:


Nice. Thanks! Can you include a test? Something like this really should have 
had a test already... great if you could add one.
(also hate to mention but the path on the patch is messed up (long prefix), can 
you address that as well?)

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt reassigned ZOOKEEPER-907:
--

Assignee: Vishal K

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client

2010-10-20 Thread Radu Marin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Marin updated ZOOKEEPER-906:
-

Attachment: (was: ZOOKEEPER.patch)

 Improve C client connection reliability by making it sleep between reconnect 
 attempts as in Java Client
 ---

 Key: ZOOKEEPER-906
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906
 Project: Zookeeper
  Issue Type: Improvement
  Components: c client
Affects Versions: 3.3.1
Reporter: Radu Marin
Assignee: Radu Marin
 Fix For: 3.4.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a C client get disconnected, it retries a couple of hosts 
 (not all) with no delay between attempts and then if it doesn't succeed it 
 sleeps for 1/3 session expiration timeout period before trying again.
 In the worst case the disconnect event can occur after 2/3 of session 
 expiration timeout has past, and sleeping for even more 1/3 session timeout 
 will cause a session loss in most of the times.
 A better approach is to check all hosts but with random delay between 
 reconnect attempts. Also the delay must be independent of session timeout so 
 if we increase the session timeout we also increase the number of available 
 attempts.
 This improvement covers the case when the C client experiences network 
 problems for a short period of time and is not able to reach any zookeeper 
 hosts.
 Java client already uses this logic and works very good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client

2010-10-20 Thread Radu Marin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Marin updated ZOOKEEPER-906:
-

Attachment: ZOOKEEPER-906.patch

+ update last_connect_index when a new successful connection is established.
+ api for configuring max_reconnect_delay (zoo_set_max_reconnect_delay).


 Improve C client connection reliability by making it sleep between reconnect 
 attempts as in Java Client
 ---

 Key: ZOOKEEPER-906
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906
 Project: Zookeeper
  Issue Type: Improvement
  Components: c client
Affects Versions: 3.3.1
Reporter: Radu Marin
Assignee: Radu Marin
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-906.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a C client get disconnected, it retries a couple of hosts 
 (not all) with no delay between attempts and then if it doesn't succeed it 
 sleeps for 1/3 session expiration timeout period before trying again.
 In the worst case the disconnect event can occur after 2/3 of session 
 expiration timeout has past, and sleeping for even more 1/3 session timeout 
 will cause a session loss in most of the times.
 A better approach is to check all hosts but with random delay between 
 reconnect attempts. Also the delay must be independent of session timeout so 
 if we increase the session timeout we also increase the number of available 
 attempts.
 This improvement covers the case when the C client experiences network 
 problems for a short period of time and is not able to reach any zookeeper 
 hosts.
 Java client already uses this logic and works very good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client

2010-10-20 Thread Radu Marin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Marin updated ZOOKEEPER-906:
-

Attachment: ZOOKEEPER-906.patch

called svn diff from trunk

 Improve C client connection reliability by making it sleep between reconnect 
 attempts as in Java Client
 ---

 Key: ZOOKEEPER-906
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906
 Project: Zookeeper
  Issue Type: Improvement
  Components: c client
Affects Versions: 3.3.1
Reporter: Radu Marin
Assignee: Radu Marin
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-906.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a C client get disconnected, it retries a couple of hosts 
 (not all) with no delay between attempts and then if it doesn't succeed it 
 sleeps for 1/3 session expiration timeout period before trying again.
 In the worst case the disconnect event can occur after 2/3 of session 
 expiration timeout has past, and sleeping for even more 1/3 session timeout 
 will cause a session loss in most of the times.
 A better approach is to check all hosts but with random delay between 
 reconnect attempts. Also the delay must be independent of session timeout so 
 if we increase the session timeout we also increase the number of available 
 attempts.
 This improvement covers the case when the C client experiences network 
 problems for a short period of time and is not able to reach any zookeeper 
 hosts.
 Java client already uses this logic and works very good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-20 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923107#action_12923107
 ] 

Vishal K commented on ZOOKEEPER-907:


sure, I will write a test.

What do you think is the effect of this bug? In 
PrepRequestProcessor.pRequest(), the leader will not pass sync request to 
nextProcessor. Does that mean that sync did not succeed?

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client

2010-10-20 Thread Radu Marin (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923109#action_12923109
 ] 

Radu Marin commented on ZOOKEEPER-906:
--

@Jared Cantwell:

Yes you got it right. The last_connect_index is intended to detect a complete 
unsuccessful loop through all hosts so the client can delay more (for 
max_reconnect_delay period).
It also represents the index of the last successful host connection, and indeed 
it was not updated on connection establishement.

I have fixed that in the new patch. Huge thanks for reviewing and pointing that 
out!
 

 Improve C client connection reliability by making it sleep between reconnect 
 attempts as in Java Client
 ---

 Key: ZOOKEEPER-906
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906
 Project: Zookeeper
  Issue Type: Improvement
  Components: c client
Affects Versions: 3.3.1
Reporter: Radu Marin
Assignee: Radu Marin
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-906.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a C client get disconnected, it retries a couple of hosts 
 (not all) with no delay between attempts and then if it doesn't succeed it 
 sleeps for 1/3 session expiration timeout period before trying again.
 In the worst case the disconnect event can occur after 2/3 of session 
 expiration timeout has past, and sleeping for even more 1/3 session timeout 
 will cause a session loss in most of the times.
 A better approach is to check all hosts but with random delay between 
 reconnect attempts. Also the delay must be independent of session timeout so 
 if we increase the session timeout we also increase the number of available 
 attempts.
 This improvement covers the case when the C client experiences network 
 problems for a short period of time and is not able to reach any zookeeper 
 hosts.
 Java client already uses this logic and works very good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work stopped: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client

2010-10-20 Thread Radu Marin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on ZOOKEEPER-906 stopped by Radu Marin.

 Improve C client connection reliability by making it sleep between reconnect 
 attempts as in Java Client
 ---

 Key: ZOOKEEPER-906
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906
 Project: Zookeeper
  Issue Type: Improvement
  Components: c client
Affects Versions: 3.3.1
Reporter: Radu Marin
Assignee: Radu Marin
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-906.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a C client get disconnected, it retries a couple of hosts 
 (not all) with no delay between attempts and then if it doesn't succeed it 
 sleeps for 1/3 session expiration timeout period before trying again.
 In the worst case the disconnect event can occur after 2/3 of session 
 expiration timeout has past, and sleeping for even more 1/3 session timeout 
 will cause a session loss in most of the times.
 A better approach is to check all hosts but with random delay between 
 reconnect attempts. Also the delay must be independent of session timeout so 
 if we increase the session timeout we also increase the number of available 
 attempts.
 This improvement covers the case when the C client experiences network 
 problems for a short period of time and is not able to reach any zookeeper 
 hosts.
 Java client already uses this logic and works very good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923116#action_12923116
 ] 

Hudson commented on ZOOKEEPER-804:
--

Integrated in ZooKeeper-trunk #973 (See 
[https://hudson.apache.org/hudson/job/ZooKeeper-trunk/973/])
ZOOKEEPER-804. c unit tests failing due to assertion cptr failed (second 
patch)


 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804-1.patch, 
 ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: implications of netty on client connections

2010-10-20 Thread Fournier, Camille F. [Tech]
Thanks Patrick, I'll look and see if I can figure out a clean change for this.
It was the kernel limit for max number of open fds for the process that was 
where the problem shows up (not zk limit). FWIW, we tested with a process fd 
limit of 16K, and ZK performed reasonably well until the fd limit was reached, 
at which point it choked. There was a throughput degradation, but mostly going 
from 0 to 4000 connections. 4000 to 16000 was mostly flat until the sharp drop. 
For our use case it is fine to have a bit of performance loss with huge numbers 
of connections, so long as we can handle the choke, which for initial rollout 
I'm planning on just monitoring for.

C

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Wednesday, October 20, 2010 2:06 PM
To: zookeeper-dev@hadoop.apache.org
Subject: Re: implications of netty on client connections

It may just be the case that we haven't tested sufficiently for this case
(running out of fds) and we need to handle this better even in nio. Probably
by cutting off op_connect in the selector. We should be able to do similar
in netty.

Btw, on unix one can access the open/max fd count using this:
http://download.oracle.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html


Secondly, are you running into a kernel limit or a zk limit? Take a look at
this post describing 1million concurrent connections to a box:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3

specifically:
--

During various test with lots of connections, I ended up making some
additional changes to my sysctl.conf. This was part trial-and-error, I don't
really know enough about the internals to make especially informed decisions
about which values to change. My policy was to wait for things to break,
check /var/log/kern.log and see what mysterious error was reported, then
increase stuff that sounded sensible after a spot of googling. Here are the
settings in place during the above test:

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 16384 33554432
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_max_tw_buckets = 36
net.core.netdev_max_backlog = 2500
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 1024 65535

--


I'm guessing that even with this, at some point you'll run into a limit in
our server implementation. In particular I suspect that we may start to
respond more slowly to pings, eventually getting so bad it would time out.
We'd have to debug that and address (optimize).

http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
Patrick

On Tue, Oct 19, 2010 at 7:16 AM, Fournier, Camille F. [Tech] 
camille.fourn...@gs.com wrote:

 Hi everyone,

 I'm curious what the implications of using netty are going to be for the
 case where a server gets close to its max available file descriptors. Right
 now our somewhat limited testing has shown that a ZK server performs fine up
 to the point when it runs out of available fds, at which point performance
 degrades sharply and new connections get into a somewhat bad state. Is netty
 going to enable the server to handle this situation more gracefully (or is
 there a way to do this already that I haven't found)? Limiting connections
 from the same client is not enough since we can potentially have far more
 clients wanting to connect than available fds for certain use cases we might
 consider.

 Thanks,
 Camille




Re: (ZOOKEEPER-905) enhance zkServer.sh for easier zookeeper automation-izing

2010-10-20 Thread Nicholas Harteau
Hi there.  I submitted a patch/jira issue for zkServer.sh (ZOOKEEPER-905).  I'm 
not sure what else to say about it that's not covered in the comments.

p.s. Thanks for the great software - I'm enjoying building my applications 
around it.

--
nicholas harteau
n...@ikami.com





[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client

2010-10-20 Thread Radu Marin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radu Marin updated ZOOKEEPER-906:
-

Attachment: (was: ZOOKEEPER-906.patch)

 Improve C client connection reliability by making it sleep between reconnect 
 attempts as in Java Client
 ---

 Key: ZOOKEEPER-906
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906
 Project: Zookeeper
  Issue Type: Improvement
  Components: c client
Affects Versions: 3.3.1
Reporter: Radu Marin
Assignee: Radu Marin
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-906.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a C client get disconnected, it retries a couple of hosts 
 (not all) with no delay between attempts and then if it doesn't succeed it 
 sleeps for 1/3 session expiration timeout period before trying again.
 In the worst case the disconnect event can occur after 2/3 of session 
 expiration timeout has past, and sleeping for even more 1/3 session timeout 
 will cause a session loss in most of the times.
 A better approach is to check all hosts but with random delay between 
 reconnect attempts. Also the delay must be independent of session timeout so 
 if we increase the session timeout we also increase the number of available 
 attempts.
 This improvement covers the case when the C client experiences network 
 problems for a short period of time and is not able to reach any zookeeper 
 hosts.
 Java client already uses this logic and works very good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-20 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923200#action_12923200
 ] 

Benjamin Reed commented on ZOOKEEPER-907:
-

yes, this will fail the sync. it will not get passed through the pipeline. it 
will give you a partial sync though :)

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-794) Callbacks are not invoked when the client is closed

2010-10-20 Thread Alexis Midon (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923251#action_12923251
 ] 

Alexis Midon commented on ZOOKEEPER-794:


no pb, thanks for your close review and testing.

 Callbacks are not invoked when the client is closed
 ---

 Key: ZOOKEEPER-794
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-794
 Project: Zookeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.1
Reporter: Alexis Midon
Assignee: Alexis Midon
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-794.patch.txt, ZOOKEEPER-794.txt, 
 ZOOKEEPER-794_2.patch, ZOOKEEPER-794_3.patch, ZOOKEEPER-794_4.patch.txt, 
 ZOOKEEPER-794_5.patch.txt, ZOOKEEPER-794_5_br33.patch


 I noticed that ZooKeeper has different behaviors when calling synchronous or 
 asynchronous actions on a closed ZooKeeper client.
 Actually a synchronous call will throw a session expired exception while an 
 asynchronous call will do nothing. No exception, no callback invocation.
 Actually, even if the EventThread receives the Packet with the session 
 expired err code, the packet is never processed since the thread has been 
 killed by the ventOfDeath. So the call back is not invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-794) Callbacks are not invoked when the client is closed

2010-10-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-794:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed to trunk/branch33, thanks Alexis!

 Callbacks are not invoked when the client is closed
 ---

 Key: ZOOKEEPER-794
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-794
 Project: Zookeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.1
Reporter: Alexis Midon
Assignee: Alexis Midon
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-794.patch.txt, ZOOKEEPER-794.txt, 
 ZOOKEEPER-794_2.patch, ZOOKEEPER-794_3.patch, ZOOKEEPER-794_4.patch.txt, 
 ZOOKEEPER-794_5.patch.txt, ZOOKEEPER-794_5_br33.patch


 I noticed that ZooKeeper has different behaviors when calling synchronous or 
 asynchronous actions on a closed ZooKeeper client.
 Actually a synchronous call will throw a session expired exception while an 
 asynchronous call will do nothing. No exception, no callback invocation.
 Actually, even if the EventThread receives the Packet with the session 
 expired err code, the packet is never processed since the thread has been 
 killed by the ventOfDeath. So the call back is not invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.