Re: 3.4 Release.

2011-08-05 Thread Vishal Kher
Sorry, I meant to say 1144 instead of 1145  in my earlier mail :-) I will
fix 1144.
Eugene, you can leave 1145 assigned to me for now. If 1144 doesn't resolve
it, I will assign it back to you :-)

On Fri, Aug 5, 2011 at 7:28 PM, Eugene Koontz  wrote:

> On 8/5/11 4:03 PM, Vishal Kher wrote:
>
>> Thanks, Camille. I am pretty sure that the bug that  Eugene is seeing is
>> because of 1145 as well.  In the interest of time, I wanted to make sure
>> that I wasn't doing the same changes that Eugene was working on. I will
>> fix
>> 1145 since I had spent some time debugging it and have a fair idea of the
>> fix. I think we need help with the other tests  (though they could be
>> failing because of 1145).
>>
>> -Vishal
>>
>>
>>  Thanks Vishal; I will reassign 1145 to you.
> -Eugene
>


[jira] [Assigned] (ZOOKEEPER-1145) ObserverTest.testObserver fails at particular point after several runs of ant junt.run -Dtestcase=ObserverTest

2011-08-05 Thread Eugene Koontz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koontz reassigned ZOOKEEPER-1145:


Assignee: Vishal Kher  (was: Eugene Koontz)

> ObserverTest.testObserver fails at particular point after several runs of ant 
> junt.run -Dtestcase=ObserverTest
> --
>
> Key: ZOOKEEPER-1145
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1145
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Eugene Koontz
>Assignee: Vishal Kher
>Priority: Blocker
> Fix For: 3.4.0
>
> Attachments: out.txt, repeat.sh
>
>
> Use the attached repeat.sh to run ObserverTest repeatedly by doing: 
> src/repeat.sh ObserverTest
> The test will will fail eventually after a few iterations; should be only a 
> few minutes.
> The line that fails in the test is: 
> zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_OBS,
> ClientBase.CONNECTION_TIMEOUT, this);
> Attached as out.txt is the output showing a successful run, for comparison, 
> followed by a failed run.
> Note that in the seconds before the test fails, in the following lines, that 
> there is a 24 second gap in time (between 22:13:02 and 22:13:26):
> bq.
> [junit] 2011-08-03 22:13:02,167 [myid:3] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11229:ZooKeeperServer@833] - Client 
> attempting to establish new session at /127.0.0.1:46929
> [junit] 2011-08-03 22:13:26,003 [myid:2] - INFO  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11228:Leader@419] - Shutting down
> [junit] 2011-08-03 22:13:26,003 [myid:2] - INFO  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11228:Leader@425] - Shutdown called
> [junit] java.lang.Exception: shutdown Leader! reason: Only 0 followers, need 1

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: 3.4 Release.

2011-08-05 Thread Eugene Koontz

On 8/5/11 4:03 PM, Vishal Kher wrote:

Thanks, Camille. I am pretty sure that the bug that  Eugene is seeing is
because of 1145 as well.  In the interest of time, I wanted to make sure
that I wasn't doing the same changes that Eugene was working on. I will fix
1145 since I had spent some time debugging it and have a fair idea of the
fix. I think we need help with the other tests  (though they could be
failing because of 1145).

-Vishal



Thanks Vishal; I will reassign 1145 to you.
-Eugene


Re: 3.4 Release.

2011-08-05 Thread Vishal Kher
Thanks, Camille. I am pretty sure that the bug that  Eugene is seeing is
because of 1145 as well.  In the interest of time, I wanted to make sure
that I wasn't doing the same changes that Eugene was working on. I will fix
1145 since I had spent some time debugging it and have a fair idea of the
fix. I think we need help with the other tests  (though they could be
failing because of 1145).

-Vishal


On Fri, Aug 5, 2011 at 2:47 PM, Fournier, Camille F. <
camille.fourn...@gs.com> wrote:

> I might have time this weekend to look at some of this. If I do, I will
> keep you guys in the loop.
>
> C
>
> -Original Message-
> From: Eugene Koontz [mailto:ekoo...@hiro-tan.org]
> Sent: Friday, August 05, 2011 2:02 PM
> To: dev@zookeeper.apache.org
> Subject: Re: 3.4 Release.
>
> On 8/4/11 4:48 PM, Vishal Kher wrote:
> > Is your fix for 1145 also going to fix 1144 (are you fixing the Leader)?
> >
> Hi Vishal,
> I'm not at all sure yet..will update the 1145 when I have a chance
> to look at it next week.
> -Eugene
>


Re: 3.4 Release.

2011-08-05 Thread Mahadev Konar
Thats great!


mahadev


On Fri, Aug 5, 2011 at 11:47 AM, Fournier, Camille F. <
camille.fourn...@gs.com> wrote:

> I might have time this weekend to look at some of this. If I do, I will
> keep you guys in the loop.
>
> C
>
> -Original Message-
> From: Eugene Koontz [mailto:ekoo...@hiro-tan.org]
> Sent: Friday, August 05, 2011 2:02 PM
> To: dev@zookeeper.apache.org
> Subject: Re: 3.4 Release.
>
> On 8/4/11 4:48 PM, Vishal Kher wrote:
> > Is your fix for 1145 also going to fix 1144 (are you fixing the Leader)?
> >
> Hi Vishal,
> I'm not at all sure yet..will update the 1145 when I have a chance
> to look at it next week.
> -Eugene
>


Re: 3.4 Release.

2011-08-05 Thread Eugene Koontz

On 08/05/2011 11:47 AM, Fournier, Camille F. wrote:

I might have time this weekend to look at some of this. If I do, I will keep 
you guys in the loop.

C


Thanks Camille,
Much appreciated.
-Eugene


RE: 3.4 Release.

2011-08-05 Thread Fournier, Camille F.
I might have time this weekend to look at some of this. If I do, I will keep 
you guys in the loop.

C

-Original Message-
From: Eugene Koontz [mailto:ekoo...@hiro-tan.org] 
Sent: Friday, August 05, 2011 2:02 PM
To: dev@zookeeper.apache.org
Subject: Re: 3.4 Release.

On 8/4/11 4:48 PM, Vishal Kher wrote:
> Is your fix for 1145 also going to fix 1144 (are you fixing the Leader)?
>
Hi Vishal,
 I'm not at all sure yet..will update the 1145 when I have a chance 
to look at it next week.
-Eugene


Re: 3.4 Release.

2011-08-05 Thread Eugene Koontz

On 8/4/11 4:48 PM, Vishal Kher wrote:

Is your fix for 1145 also going to fix 1144 (are you fixing the Leader)?


Hi Vishal,
I'm not at all sure yet..will update the 1145 when I have a chance 
to look at it next week.

-Eugene


Re: devops/admin/client question: What do you do when you rollback?

2011-08-05 Thread Patrick Hunt
On Fri, Aug 5, 2011 at 9:01 AM, Fournier, Camille F.
 wrote:
> Actuallly can I update the ConnectRequest protocol version number? If I 
> can do that, I can have the server only send back the indicating 
> ConnectResponse on clients with a higher protocol version. It doesn't look 
> like it's read anywhere right now.
> (Moving this to dev since we've moved to a dev discussion)

That's what I was going to suggest - upping the protocol version
number for new clients. New servers can respond with
ConnectionResponse2 if they see the new version, this response should
have improved semantics. Otw they can just respond with the old
version/resp. New clients will have to handle both types of responses.

Patrick

>
> -Original Message-
> From: Fournier, Camille F. [Tech]
> Sent: Friday, August 05, 2011 11:57 AM
> To: 'u...@zookeeper.apache.org'
> Subject: RE: devops/admin/client question: What do you do when you rollback?
>
> Hmmm. I thought I had another way around this but I don't. We really didn't 
> write the client to be easy to encode other errors in the connection 
> result... I think any good solution will have to be in our 4.0 clojure 
> rewrite ;)
>
> C
>
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Friday, August 05, 2011 11:51 AM
> To: u...@zookeeper.apache.org
> Subject: Re: devops/admin/client question: What do you do when you rollback?
>
> If you get the lower zxid from the leader then you know that things have
> gone south.
>
> Likewise, if you get a lower epoch number from a node that thinks that it is
> in quorum then things are not good.  The definition of "thinks it is in
> quorum" is problematic of course.
>
> On Fri, Aug 5, 2011 at 10:57 AM, Fournier, Camille F. <
> camille.fourn...@gs.com> wrote:
>
>> Oh blah, of course it won't be b/w compatible, because all the older
>> clients would expire their sessions in the instance of a single zxid higher
>> than the cluster zxid which I doubt most people want.
>>
>> Is there a way to check if the zxid of the client is higher than the
>> current possible zxid after connection, and send the session_expired then?
>> That would at least help us out most of the way.
>>
>> -Original Message-
>> From: Patrick Hunt [mailto:ph...@apache.org]
>> Sent: Thursday, August 04, 2011 7:23 PM
>> To: u...@zookeeper.apache.org
>> Subject: Re: devops/admin/client question: What do you do when you
>> rollback?
>>
>> Sounds reasonable to me as long as it's b/w compatible (which it seems
>> like it would be), anything we can do to improve this situation would
>> be huge - I frequently see our support team trying to address this
>> (e.g. the max count exceeded issue) with clients like hbase. Def plus
>> for supportability.
>>
>> Patrick
>>
>> On Thu, Aug 4, 2011 at 4:11 PM, Camille Fournier 
>> wrote:
>> > I'm thinking of hacking it through the connectresponse session timeout
>> > (similar to the way we detect session rejected). I wrote up a prototype
>> that
>> > worked ok this way. Might could extend this hack to other things, using
>> that
>> > field as an encoded error msg, thoughts?
>> >
>> > C
>> > On Aug 4, 2011 6:10 PM, "Patrick Hunt"  wrote:
>> >> Our error reporting server->client has always been weak. It's a PITA
>> >> to debug in production because a lot of times when the client gets
>> >> bounced it's not clear from the client side why (you end up having to
>> >> search the server log - for example when maxClientCount is exceeded).
>> >> It would be great to fix this, esp if the server could provide insight
>> >> to the client about why (an error code/message perhaps). Doing it in a
>> >> b/w compatible way might be tough though...
>> >>
>> >> Patrick
>> >>
>> >> On Thu, Aug 4, 2011 at 2:45 PM, Ted Dunning 
>> wrote:
>> >>> This is used normally to guarantee in-order data views.  If you get
>> >>> disconnected from one host in an advanced state and then connect to an
>> > out
>> >>> of date slave, ZK automatically disconnects you to avoid letting you
>> see
>> >>> time go backwards.  Your situation is different of course.
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Aug 4, 2011 at 7:05 PM, Fournier, Camille F. <
>> >>> camille.fourn...@gs.com> wrote:
>> >>>
>>  Right now the server just detects that the zxid is wrong, and calls
>> > close
>>  on the client. The client logs:
>>  15:01:47,593 - INFO
>>   [main-SendThread(localhost:2181):ClientCnxn$SendThread@1159] -
>> Unable
>> > to
>>  read additional data from server sessionid 0x131962b0054, likely
>> > server
>>  has closed socket, closing socket connection and attempting reconnect
>>  (branch 3.3.3)
>> 
>>  I will poke around and see if I can figure out a nicer way to indicate
>> > this
>>  condition. The expired state is perfectly fine for me in my use case.
>> 
>>  C
>> 
>> 
>>  -Original Message-
>>  From: Patrick Hunt [mailto:ph...@apache.org]
>>  Sent: Thursday, August 0

RE: devops/admin/client question: What do you do when you rollback?

2011-08-05 Thread Fournier, Camille F.
Actuallly can I update the ConnectRequest protocol version number? If I can 
do that, I can have the server only send back the indicating ConnectResponse on 
clients with a higher protocol version. It doesn't look like it's read anywhere 
right now.
(Moving this to dev since we've moved to a dev discussion)

C

-Original Message-
From: Fournier, Camille F. [Tech] 
Sent: Friday, August 05, 2011 11:57 AM
To: 'u...@zookeeper.apache.org'
Subject: RE: devops/admin/client question: What do you do when you rollback?

Hmmm. I thought I had another way around this but I don't. We really didn't 
write the client to be easy to encode other errors in the connection result... 
I think any good solution will have to be in our 4.0 clojure rewrite ;)

C


-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Friday, August 05, 2011 11:51 AM
To: u...@zookeeper.apache.org
Subject: Re: devops/admin/client question: What do you do when you rollback?

If you get the lower zxid from the leader then you know that things have
gone south.

Likewise, if you get a lower epoch number from a node that thinks that it is
in quorum then things are not good.  The definition of "thinks it is in
quorum" is problematic of course.

On Fri, Aug 5, 2011 at 10:57 AM, Fournier, Camille F. <
camille.fourn...@gs.com> wrote:

> Oh blah, of course it won't be b/w compatible, because all the older
> clients would expire their sessions in the instance of a single zxid higher
> than the cluster zxid which I doubt most people want.
>
> Is there a way to check if the zxid of the client is higher than the
> current possible zxid after connection, and send the session_expired then?
> That would at least help us out most of the way.
>
> -Original Message-
> From: Patrick Hunt [mailto:ph...@apache.org]
> Sent: Thursday, August 04, 2011 7:23 PM
> To: u...@zookeeper.apache.org
> Subject: Re: devops/admin/client question: What do you do when you
> rollback?
>
> Sounds reasonable to me as long as it's b/w compatible (which it seems
> like it would be), anything we can do to improve this situation would
> be huge - I frequently see our support team trying to address this
> (e.g. the max count exceeded issue) with clients like hbase. Def plus
> for supportability.
>
> Patrick
>
> On Thu, Aug 4, 2011 at 4:11 PM, Camille Fournier 
> wrote:
> > I'm thinking of hacking it through the connectresponse session timeout
> > (similar to the way we detect session rejected). I wrote up a prototype
> that
> > worked ok this way. Might could extend this hack to other things, using
> that
> > field as an encoded error msg, thoughts?
> >
> > C
> > On Aug 4, 2011 6:10 PM, "Patrick Hunt"  wrote:
> >> Our error reporting server->client has always been weak. It's a PITA
> >> to debug in production because a lot of times when the client gets
> >> bounced it's not clear from the client side why (you end up having to
> >> search the server log - for example when maxClientCount is exceeded).
> >> It would be great to fix this, esp if the server could provide insight
> >> to the client about why (an error code/message perhaps). Doing it in a
> >> b/w compatible way might be tough though...
> >>
> >> Patrick
> >>
> >> On Thu, Aug 4, 2011 at 2:45 PM, Ted Dunning 
> wrote:
> >>> This is used normally to guarantee in-order data views.  If you get
> >>> disconnected from one host in an advanced state and then connect to an
> > out
> >>> of date slave, ZK automatically disconnects you to avoid letting you
> see
> >>> time go backwards.  Your situation is different of course.
> >>>
> >>>
> >>>
> >>> On Thu, Aug 4, 2011 at 7:05 PM, Fournier, Camille F. <
> >>> camille.fourn...@gs.com> wrote:
> >>>
>  Right now the server just detects that the zxid is wrong, and calls
> > close
>  on the client. The client logs:
>  15:01:47,593 - INFO
>   [main-SendThread(localhost:2181):ClientCnxn$SendThread@1159] -
> Unable
> > to
>  read additional data from server sessionid 0x131962b0054, likely
> > server
>  has closed socket, closing socket connection and attempting reconnect
>  (branch 3.3.3)
> 
>  I will poke around and see if I can figure out a nicer way to indicate
> > this
>  condition. The expired state is perfectly fine for me in my use case.
> 
>  C
> 
> 
>  -Original Message-
>  From: Patrick Hunt [mailto:ph...@apache.org]
>  Sent: Thursday, August 04, 2011 1:51 PM
>  To: u...@zookeeper.apache.org
>  Subject: Re: devops/admin/client question: What do you do when you
>  rollback?
> 
>  On Thu, Aug 4, 2011 at 10:29 AM, Fournier, Camille F.
>   wrote:
>  > We had an issue here the other day where the ZK servers were running
>  poorly, and in an effort to get them healthy again we ended up rolling
> > back
>  the cluster state. While this was, in retrospect, not the right
> solution
> > to
>  the problem we were facing, it brought up anot