[jira] [Commented] (ZOOKEEPER-1440) Spurious log error messages when QuorumCnxManager is shutting down

2012-05-16 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276523#comment-13276523
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-1440:


Ah sorry I should've caught that. Jordan's new patch looks good to me. Pat, 
I'll wait for your +1 before checking in this time :)

--Michi

 Spurious log error messages when QuorumCnxManager is shutting down
 --

 Key: ZOOKEEPER-1440
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1440
 Project: ZooKeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.4.3
Reporter: Jordan Zimmerman
Assignee: Jordan Zimmerman
Priority: Minor
 Fix For: 3.5.0

 Attachments: patch.txt, patch.txt


 When shutting down the QuroumPeer, ZK server logs unnecessary errors. See 
 QuorumCnxManager.Listener.run() - ss.accept() will throw an exception when it 
 is closed. The catch (IOException e) will log errors. It should first check 
 the shutdown field to see if the Listener is being shutdown. If it is, the 
 exception is correct and no errors should be logged.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (ZOOKEEPER-1467) Server principal on client side is derived using hostname.

2012-05-16 Thread Laxman (JIRA)
Laxman created ZOOKEEPER-1467:
-

 Summary: Server principal on client side is derived using hostname.
 Key: ZOOKEEPER-1467
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1467
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.3, 3.4.4, 3.5.0, 4.0.0
Reporter: Laxman
Priority: Blocker


Server principal on client side is derived using hostname.

org.apache.zookeeper.ClientCnxn.SendThread.startConnect()
{code}
   try {
zooKeeperSaslClient = new 
ZooKeeperSaslClient(zookeeper/+addr.getHostName());
}
{code}

This may have problems when admin wanted some customized principals like 
zookeeper/cluste...@hadoop.com where clusterid is the cluster identifier but 
not the host name.

IMO, server principal also should be configurable as hadoop is doing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1437) Client uses session before SASL authentication complete

2012-05-16 Thread Eugene Koontz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koontz updated ZOOKEEPER-1437:
-

Attachment: ZOOKEEPER-1437.patch

Use a CountDownLatch within ClientCnxn to control access to outgoing packet 
queue: non-SASL packets must wait until SASL authentication has completed.

 Client uses session before SASL authentication complete
 ---

 Key: ZOOKEEPER-1437
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1437
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.3
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch


 Found issue in the context of hbase region server startup, but can be 
 reproduced w/ zkCli alone.
 getData may occur prior to SaslAuthenticated and fail with NoAuth. This is 
 not expected behavior when the client is configured to use SASL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1437) Client uses session before SASL authentication complete

2012-05-16 Thread Eugene Koontz (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276988#comment-13276988
 ] 

Eugene Koontz commented on ZOOKEEPER-1437:
--

Excuse me, I meant ClientCnxn:queuePacket(), not 
ClientCnxn:queueSaslPacket() in my 20:26 comment above.

 Client uses session before SASL authentication complete
 ---

 Key: ZOOKEEPER-1437
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1437
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.3
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch


 Found issue in the context of hbase region server startup, but can be 
 reproduced w/ zkCli alone.
 getData may occur prior to SaslAuthenticated and fail with NoAuth. This is 
 not expected behavior when the client is configured to use SASL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Success: ZOOKEEPER-1437 PreCommit Build #1076

2012-05-16 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-1437
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1076/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 182317 lines...]
 [exec] BUILD SUCCESSFUL
 [exec] Total time: 0 seconds
 [exec] 
 [exec] 
 [exec] 
 [exec] 
 [exec] +1 overall.  Here are the results of testing the latest attachment 
 [exec]   
http://issues.apache.org/jira/secure/attachment/12527672/ZOOKEEPER-1437.patch
 [exec]   against trunk revision 1337029.
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 15 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 1.3.9) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] +1 core tests.  The patch passed core unit tests.
 [exec] 
 [exec] +1 contrib tests.  The patch passed contrib unit tests.
 [exec] 
 [exec] Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1076//testReport/
 [exec] Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1076//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1076//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] aI6vlNOI45 logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 

BUILD SUCCESSFUL
Total time: 26 minutes 42 seconds
Archiving artifacts
Recording test results
Description set: ZOOKEEPER-1437
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (ZOOKEEPER-1437) Client uses session before SASL authentication complete

2012-05-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277016#comment-13277016
 ] 

Hadoop QA commented on ZOOKEEPER-1437:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527672/ZOOKEEPER-1437.patch
  against trunk revision 1337029.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 15 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1076//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1076//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1076//console

This message is automatically generated.

 Client uses session before SASL authentication complete
 ---

 Key: ZOOKEEPER-1437
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1437
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.3
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, ZOOKEEPER-1437.patch, 
 ZOOKEEPER-1437.patch


 Found issue in the context of hbase region server startup, but can be 
 reproduced w/ zkCli alone.
 getData may occur prior to SaslAuthenticated and fail with NoAuth. This is 
 not expected behavior when the client is configured to use SASL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Possible issue with cluster availability following new Leader Election - ZK 3.4

2012-05-16 Thread Vinayak Khot
We also have encountered a problem where the newly elected leader
sends entire
snapshot to a follower even though the follower is in sync with the leader.

A closer look at the code shows the problem in the logic where we decide to
send
a snapshot.
Following scenario explains the problem in details.
Start a 3 node Zookeeper ensemble where every quorum member has seen same
changes.
zxid: *0x40004*

1. When a newly elected leader starts, it bumps up its zxid to the new
epoch.

Code snippet Leader.java

long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
synchronized(this){
 lastProposed = zk.getZxid();  // *0x5*
}

2. Now a follower tries to join the leader with its peerLastZxid = *
0x40004*

Note that now the leader has in memory committedLog list with* *
maxCommittedLog=*0x40004** *
*
*
As committedLog don't have any new transactions which have zxid 
peerLastZxid, we check if
the leader and follower are in sync.

Code snippet from LearnerHandler.java
leaderLastZxid = leader.startForwarding(this, updates);
if (peerLastZxid == leaderLastZxid) {   *0x40004 == **0x5*
   // We are in sync so we'll do an empty diff
   packetToSend = Leader.DIFF;
   zxidToSend = leaderLastZxid;
}

Note that the function *leader.startForwarding()* returns *lastProposed *zxid
which is already set to
*0x5 *by the leader.
So in this scenario we never send empty diff even though the leader and
follower are in sync,
and we end up sending entire snapshot in the code that follows above check.

A possible fix would be to keep *lastProcessedZxid* in the leader which
will get updated only when
the leader processes a transaction. While syncing with a follower, if the
peerLastZxid sent by a follower
is same as lastProcessedZxid of the leader we can send empty diff to the
follower.
This shall avoid unnecessarily sending entire snapshot when the leader and
follower are already in sync.

Zookeeper developers please share your views on above mentioned issue.

- Vinayak

On Mon, May 14, 2012 at 8:30 AM, Camille Fournier cami...@apache.orgwrote:

 Thanks.
 I just ran a couple of tests to start the debugging. Mark, I don't see
 a long cluster settle with a mostly empty data set, so I think this
 might be two different problems. I do see a lot of snapshots being
 sent though so there is probably some overaggressiveness in the way
 that we evaluate when to send snapshots that should be evaluated.
 Adding the dev mailing list, as I may need ben or flavio to take a
 look as well.

 C

 On Thu, May 10, 2012 at 10:48 AM,  alexandar.gvozdeno...@ubs.com wrote:
  Cheers - Raised https://issues.apache.org/jira/browse/ZOOKEEPER-1465
 
 
 
  -Original Message-
  From: Camille Fournier [mailto:cami...@apache.org]
  Sent: 10 May 2012 14:58
  To: u...@zookeeper.apache.org
  Subject: Re: Possible issue with cluster availability following new
 Leader Election - ZK 3.4
 
  I will take a look at this soon, have you created a Jira for it? If not
 please do so.
 
  Thanks,
  C
 
  On Thu, May 10, 2012 at 7:20 AM,  alexandar.gvozdeno...@ubs.com wrote:
  I think there may be a problem here with the 3.4 branch. I dropped the
  cluster back to 3.3.5 and the behaviour was much better.
 
  To summarize:
 
  650mb of data
  20k nodes of varied size
  3 node cluster
 
  On 3.4.x (using latest branch build)
  -
  Takes 3-4 minutes to bring up a cluster from cold Takes 40-50 secs to
  recover from a leader failure Takes 10 secs for a new follower to join
  the cluster
 
  On 3.3.5
  
  Takes 10-20 secs to bring up a cluster from cold Takes 10 secs to
  recover from a leader failure Takes 10 secs for a new follower to join
  the cluster
 
  Any views on this from the ZK devs? The differences in behaviour only
  start becoming apparent as the dataset gets bigger.
  I was hoping to use 3.4 for the transactional features it offered via
  the 'multi-update' operations, but this issue seems pretty serious...
 
 
 
  Visit our website at http://www.ubs.com
 
  This message contains confidential information and is intended only
  for the individual named. If you are not the named addressee you
  should not disseminate, distribute or copy this e-mail. Please notify
  the sender immediately by e-mail if you have received this e-mail by
  mistake and delete this e-mail from your system.
 
  E-mails are not encrypted and cannot be guaranteed to be secure or
  error-free as information could be intercepted, corrupted, lost,
  destroyed, arrive late or incomplete, or contain viruses. The sender
  therefore does not accept liability for any errors or omissions in the
  contents of this message which arise as a result of e-mail transmission.
  If verification is required please request a hard-copy version. This
  message is provided for informational purposes and should not be
  construed as a solicitation or offer to buy or sell any securities or
  related financial 

[jira] [Updated] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-05-16 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1355:
-

Attachment: ZOOKEEPER-1355-ver12-4.patch

Updated to work correctly against tip of trunk. All unit tests should pass now.

 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver10-1.patch, 
 ZOOKEEPER-1355-ver10-2.patch, ZOOKEEPER-1355-ver10-3.patch, 
 ZOOKEEPER-1355-ver10-4.patch, ZOOKEEPER-1355-ver10-4.patch, 
 ZOOKEEPER-1355-ver10.patch, ZOOKEEPER-1355-ver11-1.patch, 
 ZOOKEEPER-1355-ver11.patch, ZOOKEEPER-1355-ver12-1.patch, 
 ZOOKEEPER-1355-ver12-2.patch, ZOOKEEPER-1355-ver12-4.patch, 
 ZOOKEEPER-1355-ver12.patch, ZOOKEEPER-1355-ver2.patch, 
 ZOOKEEPER-1355-ver4.patch, ZOOKEEPER-1355-ver5.patch, 
 ZOOKEEPER-1355-ver6.patch, ZOOKEEPER-1355-ver7.patch, 
 ZOOKEEPER-1355-ver8.patch, ZOOKEEPER-1355-ver9-1.patch, 
 ZOOKEEPER-1355-ver9.patch, ZOOKEEPER=1355-ver3.patch, 
 ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, 
 ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Failed: ZOOKEEPER-1355 PreCommit Build #1077

2012-05-16 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1077/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 174 lines...]
 [exec] Hunk #4 FAILED at 437.
 [exec] Hunk #5 FAILED at 575.
 [exec] 5 out of 5 hunks FAILED -- saving rejects to file 
src/gc/java/main/org/apache/zookeeper/ZooKeeper.java.rej
 [exec] patching file 
src/gc/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
 [exec] Hunk #1 FAILED at 211.
 [exec] 1 out of 1 hunk FAILED -- saving rejects to file 
src/gc/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java.rej
 [exec] patching file 
src/gc/java/test/org/apache/zookeeper/test/StaticHostProviderTest.java
 [exec] Hunk #1 FAILED at 29.
 [exec] Hunk #2 FAILED at 85.
 [exec] 2 out of 2 hunks FAILED -- saving rejects to file 
src/gc/java/test/org/apache/zookeeper/test/StaticHostProviderTest.java.rej
 [exec] PATCH APPLICATION FAILED
 [exec] 
 [exec] 
 [exec] 
 [exec] 
 [exec] -1 overall.  Here are the results of testing the latest attachment 
 [exec]   
http://issues.apache.org/jira/secure/attachment/12527735/ZOOKEEPER-1355-ver12-4.patch
 [exec]   against trunk revision 1337029.
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 34 new or 
modified tests.
 [exec] 
 [exec] -1 patch.  The patch command could not apply the patch.
 [exec] 
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1077//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] RqGMF8Y8I0 logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 

BUILD FAILED
/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1568:
 exec returned: 1

Total time: 42 seconds
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
Description set: ZOOKEEPER-1355
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
No tests ran.

[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-05-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277238#comment-13277238
 ] 

Hadoop QA commented on ZOOKEEPER-1355:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12527735/ZOOKEEPER-1355-ver12-4.patch
  against trunk revision 1337029.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 34 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1077//console

This message is automatically generated.

 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver10-1.patch, 
 ZOOKEEPER-1355-ver10-2.patch, ZOOKEEPER-1355-ver10-3.patch, 
 ZOOKEEPER-1355-ver10-4.patch, ZOOKEEPER-1355-ver10-4.patch, 
 ZOOKEEPER-1355-ver10.patch, ZOOKEEPER-1355-ver11-1.patch, 
 ZOOKEEPER-1355-ver11.patch, ZOOKEEPER-1355-ver12-1.patch, 
 ZOOKEEPER-1355-ver12-2.patch, ZOOKEEPER-1355-ver12-4.patch, 
 ZOOKEEPER-1355-ver12.patch, ZOOKEEPER-1355-ver2.patch, 
 ZOOKEEPER-1355-ver4.patch, ZOOKEEPER-1355-ver5.patch, 
 ZOOKEEPER-1355-ver6.patch, ZOOKEEPER-1355-ver7.patch, 
 ZOOKEEPER-1355-ver8.patch, ZOOKEEPER-1355-ver9-1.patch, 
 ZOOKEEPER-1355-ver9.patch, ZOOKEEPER=1355-ver3.patch, 
 ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, 
 ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-05-16 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1355:
-

Attachment: ZOOKEEPER-1355-ver13.patch

Bad patch last time, my apologies.

Bumped the version on this one to 13 to avoid confusion.

 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver10-1.patch, 
 ZOOKEEPER-1355-ver10-2.patch, ZOOKEEPER-1355-ver10-3.patch, 
 ZOOKEEPER-1355-ver10-4.patch, ZOOKEEPER-1355-ver10-4.patch, 
 ZOOKEEPER-1355-ver10.patch, ZOOKEEPER-1355-ver11-1.patch, 
 ZOOKEEPER-1355-ver11.patch, ZOOKEEPER-1355-ver12-1.patch, 
 ZOOKEEPER-1355-ver12-2.patch, ZOOKEEPER-1355-ver12-4.patch, 
 ZOOKEEPER-1355-ver12.patch, ZOOKEEPER-1355-ver13.patch, 
 ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, 
 ZOOKEEPER-1355-ver5.patch, ZOOKEEPER-1355-ver6.patch, 
 ZOOKEEPER-1355-ver7.patch, ZOOKEEPER-1355-ver8.patch, 
 ZOOKEEPER-1355-ver9-1.patch, ZOOKEEPER-1355-ver9.patch, 
 ZOOKEEPER=1355-ver3.patch, ZOOOKEEPER-1355-test.patch, 
 ZOOOKEEPER-1355-ver1.patch, ZOOOKEEPER-1355.patch, 
 loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-05-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277285#comment-13277285
 ] 

Hadoop QA commented on ZOOKEEPER-1355:
--

+1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12527742/ZOOKEEPER-1355-ver13.patch
  against trunk revision 1337029.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 34 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1078//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1078//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1078//console

This message is automatically generated.

 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver10-1.patch, 
 ZOOKEEPER-1355-ver10-2.patch, ZOOKEEPER-1355-ver10-3.patch, 
 ZOOKEEPER-1355-ver10-4.patch, ZOOKEEPER-1355-ver10-4.patch, 
 ZOOKEEPER-1355-ver10.patch, ZOOKEEPER-1355-ver11-1.patch, 
 ZOOKEEPER-1355-ver11.patch, ZOOKEEPER-1355-ver12-1.patch, 
 ZOOKEEPER-1355-ver12-2.patch, ZOOKEEPER-1355-ver12-4.patch, 
 ZOOKEEPER-1355-ver12.patch, ZOOKEEPER-1355-ver13.patch, 
 ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, 
 ZOOKEEPER-1355-ver5.patch, ZOOKEEPER-1355-ver6.patch, 
 ZOOKEEPER-1355-ver7.patch, ZOOKEEPER-1355-ver8.patch, 
 ZOOKEEPER-1355-ver9-1.patch, ZOOKEEPER-1355-ver9.patch, 
 ZOOKEEPER=1355-ver3.patch, ZOOOKEEPER-1355-test.patch, 
 ZOOOKEEPER-1355-ver1.patch, ZOOOKEEPER-1355.patch, 
 loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Possible issue with cluster availability following new Leader Election - ZK 3.4

2012-05-16 Thread Camille Fournier
This pretty much matches what I expect. It would be great if you
wanted to try your hand at creating a patch and submitting it to the
ticket that was created for this problem, but if not, please post this
analysis to issue 1465 and we'll look at it ASAP.

C

On Wed, May 16, 2012 at 2:55 PM, Vinayak Khot vina...@nutanix.com wrote:
 We also have encountered a problem where the newly elected leader
 sends entire
 snapshot to a follower even though the follower is in sync with the leader.

 A closer look at the code shows the problem in the logic where we decide to
 send
 a snapshot.
 Following scenario explains the problem in details.
 Start a 3 node Zookeeper ensemble where every quorum member has seen same
 changes.
 zxid: *0x40004*

 1. When a newly elected leader starts, it bumps up its zxid to the new
 epoch.

 Code snippet Leader.java

 long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
 zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
 synchronized(this){
     lastProposed = zk.getZxid();  // *0x5*
 }

 2. Now a follower tries to join the leader with its peerLastZxid = *
 0x40004*

 Note that now the leader has in memory committedLog list with* *
 maxCommittedLog=*0x40004** *
 *
 *
 As committedLog don't have any new transactions which have zxid 
 peerLastZxid, we check if
 the leader and follower are in sync.

 Code snippet from LearnerHandler.java
 leaderLastZxid = leader.startForwarding(this, updates);
 if (peerLastZxid == leaderLastZxid) {   *0x40004 == **0x5*
   // We are in sync so we'll do an empty diff
   packetToSend = Leader.DIFF;
   zxidToSend = leaderLastZxid;
 }

 Note that the function *leader.startForwarding()* returns *lastProposed *zxid
 which is already set to
 *0x5 *by the leader.
 So in this scenario we never send empty diff even though the leader and
 follower are in sync,
 and we end up sending entire snapshot in the code that follows above check.

 A possible fix would be to keep *lastProcessedZxid* in the leader which
 will get updated only when
 the leader processes a transaction. While syncing with a follower, if the
 peerLastZxid sent by a follower
 is same as lastProcessedZxid of the leader we can send empty diff to the
 follower.
 This shall avoid unnecessarily sending entire snapshot when the leader and
 follower are already in sync.

 Zookeeper developers please share your views on above mentioned issue.

 - Vinayak

 On Mon, May 14, 2012 at 8:30 AM, Camille Fournier cami...@apache.orgwrote:

 Thanks.
 I just ran a couple of tests to start the debugging. Mark, I don't see
 a long cluster settle with a mostly empty data set, so I think this
 might be two different problems. I do see a lot of snapshots being
 sent though so there is probably some overaggressiveness in the way
 that we evaluate when to send snapshots that should be evaluated.
 Adding the dev mailing list, as I may need ben or flavio to take a
 look as well.

 C

 On Thu, May 10, 2012 at 10:48 AM,  alexandar.gvozdeno...@ubs.com wrote:
  Cheers - Raised https://issues.apache.org/jira/browse/ZOOKEEPER-1465
 
 
 
  -Original Message-
  From: Camille Fournier [mailto:cami...@apache.org]
  Sent: 10 May 2012 14:58
  To: u...@zookeeper.apache.org
  Subject: Re: Possible issue with cluster availability following new
 Leader Election - ZK 3.4
 
  I will take a look at this soon, have you created a Jira for it? If not
 please do so.
 
  Thanks,
  C
 
  On Thu, May 10, 2012 at 7:20 AM,  alexandar.gvozdeno...@ubs.com wrote:
  I think there may be a problem here with the 3.4 branch. I dropped the
  cluster back to 3.3.5 and the behaviour was much better.
 
  To summarize:
 
  650mb of data
  20k nodes of varied size
  3 node cluster
 
  On 3.4.x (using latest branch build)
  -
  Takes 3-4 minutes to bring up a cluster from cold Takes 40-50 secs to
  recover from a leader failure Takes 10 secs for a new follower to join
  the cluster
 
  On 3.3.5
  
  Takes 10-20 secs to bring up a cluster from cold Takes 10 secs to
  recover from a leader failure Takes 10 secs for a new follower to join
  the cluster
 
  Any views on this from the ZK devs? The differences in behaviour only
  start becoming apparent as the dataset gets bigger.
  I was hoping to use 3.4 for the transactional features it offered via
  the 'multi-update' operations, but this issue seems pretty serious...
 
 
 
  Visit our website at http://www.ubs.com
 
  This message contains confidential information and is intended only
  for the individual named. If you are not the named addressee you
  should not disseminate, distribute or copy this e-mail. Please notify
  the sender immediately by e-mail if you have received this e-mail by
  mistake and delete this e-mail from your system.
 
  E-mails are not encrypted and cannot be guaranteed to be secure or
  error-free as information could be intercepted, corrupted, lost,
  destroyed, arrive late or incomplete, or contain viruses. The sender

[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-05-16 Thread Eugene Koontz (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277532#comment-13277532
 ] 

Eugene Koontz commented on ZOOKEEPER-1355:
--

Typo in src/docs/src/documentation/content/xdocs/zookeeperProgrammers.xml: 
jus should be just.

The documentation is good - perhaps it could make reference in the source code 
to where the client's server-selection logic is implemented 
(StaticHostProvider::updateServerList()).

i.e. provide a link to the source such as 
http://svn.apache.org/viewvc/zookeeper/trunk/src/java/main/org/apache/zookeeper/client/StaticHostProvider.java?view=markup
 or 
https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/client/StaticHostProvider.java

The test cases seem to have a lot more cases than the documentation: would be 
nice to have the doc's examples correspond to the test cases.

 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver10-1.patch, 
 ZOOKEEPER-1355-ver10-2.patch, ZOOKEEPER-1355-ver10-3.patch, 
 ZOOKEEPER-1355-ver10-4.patch, ZOOKEEPER-1355-ver10-4.patch, 
 ZOOKEEPER-1355-ver10.patch, ZOOKEEPER-1355-ver11-1.patch, 
 ZOOKEEPER-1355-ver11.patch, ZOOKEEPER-1355-ver12-1.patch, 
 ZOOKEEPER-1355-ver12-2.patch, ZOOKEEPER-1355-ver12-4.patch, 
 ZOOKEEPER-1355-ver12.patch, ZOOKEEPER-1355-ver13.patch, 
 ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, 
 ZOOKEEPER-1355-ver5.patch, ZOOKEEPER-1355-ver6.patch, 
 ZOOKEEPER-1355-ver7.patch, ZOOKEEPER-1355-ver8.patch, 
 ZOOKEEPER-1355-ver9-1.patch, ZOOKEEPER-1355-ver9.patch, 
 ZOOKEEPER=1355-ver3.patch, ZOOOKEEPER-1355-test.patch, 
 ZOOOKEEPER-1355-ver1.patch, ZOOOKEEPER-1355.patch, 
 loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (BOOKKEEPER-262) Implement a meta store based hedwig metadata manager.

2012-05-16 Thread Sijie Guo (JIRA)
Sijie Guo created BOOKKEEPER-262:


 Summary: Implement a meta store based hedwig metadata manager.
 Key: BOOKKEEPER-262
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-262
 Project: Bookkeeper
  Issue Type: Sub-task
  Components: hedwig-server
Reporter: Sijie Guo
 Fix For: 4.2.0


We had provided a metadata manager interface by BOOKKEEPER-250  
BOOKKEEPER-259. we need a metadata manager implementation use meta store API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-16 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276580#comment-13276580
 ] 

Ivan Kelly commented on BOOKKEEPER-253:
---

Ah yes, I had misunderstood the problem. I think the Write permission node 
will work, but it needs a small modification to ensure that in the time period 
between deleting and acquiring the write permission and creating the using 
the ledger, and other node doesn't come in and do the same. I think it should 
work as follows. 

There is one znode, the write permission znode, /journal/writeLock
When a node wants to start writing, it must read the znode to see what the 
current inprogress_znode is. At this point it saves the version of the 
writeLock znode. It then recovers the inprogress_znode, which will fence the 
ledger which it is using. It creates its own ledger, and then writes the new 
inprogress_znode to writeLock, using the version it previously saved.
If another node has tried to start writing before this, the version will have 
changed, so the write will fail. 

 BKJM:Switch from standby to active fails and NN gets shut down due to delay 
 in clearing of lock
 ---

 Key: BOOKKEEPER-253
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-client
Reporter: suja s
Assignee: Uma Maheswara Rao G
Priority: Blocker

 Normal switch fails. 
 (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
 5000. By the time control comes to acquire lock the previous lock is not 
 released which leads to failure in lock acquisition by NN and NN gets 
 shutdown. Ideally it should have been done)
 =
 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
 Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
 already has it
 2012-05-09 20:15:29,732 FATAL 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
 recoverUnfinalizedSegments failed for required journal 
 (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
  stream=null))
 java.io.IOException: Could not acquire lock
 at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
 at 
 org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
 at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
 at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
 Scenario:
 Start ZKFCS, NNs
 NN1 is active and NN2 is standby
 Stop NN1. NN2 tries to transition to active and gets 

[jira] [Updated] (BOOKKEEPER-258) CompactionTest failed

2012-05-16 Thread Sijie Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sijie Guo updated BOOKKEEPER-258:
-

Attachment: BOOKKEEPER-258.diff

I did set readTimeout to a large value to disable readTimeout during testing. 
and I ran  while [ $? = 0 ]; do mvn test -Dtest=CompactionTest  
compaction.log; done  for several hours, it doesn't reproduce the issue.

 CompactionTest failed
 -

 Key: BOOKKEEPER-258
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-258
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-server
Reporter: Flavio Junqueira
Assignee: Sijie Guo
Priority: Blocker
 Fix For: 4.1.0

 Attachments: BOOKKEEPER-258.diff


 {noformat}
 ---
 Test set: org.apache.bookkeeper.bookie.CompactionTest
 ---
 Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 32.557 sec 
  FAILURE!
 testCompactionSmallEntryLogs(org.apache.bookkeeper.bookie.CompactionTest)  
 Time elapsed: 6.507 sec   ERROR!
 org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException
 at 
 org.apache.bookkeeper.client.BKException.create(BKException.java:62)
 at 
 org.apache.bookkeeper.client.LedgerHandle.readEntries(LedgerHandle.java:347)
 at 
 org.apache.bookkeeper.bookie.CompactionTest.verifyLedger(CompactionTest.java:128)
 at 
 org.apache.bookkeeper.bookie.CompactionTest.testCompactionSmallEntryLogs(CompactionTest.java:317)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
 at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
 at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
 at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
 at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
 at 
 org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
 at 
 org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
 at 
 org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
 at 
 org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
 at 
 org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:172)
 at 
 org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:78)
 at 
 org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:70)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (BOOKKEEPER-258) CompactionTest failed

2012-05-16 Thread Sijie Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276607#comment-13276607
 ] 

Sijie Guo commented on BOOKKEEPER-258:
--

@Ivan

to explain why readTimeout cause this issue. we had to clarify two things, 1) 
how readTimeout works? 2) how the test runs.

for first thing, from Netty documents 
(http://docs.jboss.org/netty/3.1/api/org/jboss/netty/handler/timeout/ReadTimeoutHandler.html).
 The timeout happened when no data was read within a certain of period time.

for second thing, CompactionTest#testCompactionSmallEntryLogs ran as below:
1) add several messages to bookkeeper. (so the connection will be established 
to bookie server)
2) delete ledgers and sleep to wait for GC. the sleep interval is 
{MajorCompactionInterval + GcWaitTime}, which is 5 seconds, equals to 
ReadTimeout (is also 5 seconds). so during 5 seconds, there is no activities. 
The channel might be time out after the sleep interval.
3) read entries to verify them.
{code}

client.connectIfNeededAndDoOp(new GenericCallbackVoid() {
@Override
public void operationComplete(int rc, Void result) {

if (rc != BKException.Code.OK) {
cb.readEntryComplete(rc, ledgerId, entryId, null, ctx);
return;
}   
client.readEntry(ledgerId, entryId, cb, ctx);
}   
}); 
{code}

As the code indicated above, the channel is only checked when calling 
client.connectIfNeededAndDoOp. If  the channel is not set to disconnected, the 
client.readEntry will be called to send requests. After client.readEntry put 
the completion keys in completion pending queue, the channel timeout happened 
(there is no data read from channel, and the time has been up to 5 seconds due 
to sleep), all those requests would be error out.


For continuous traffic, it is OK because there is data to read on the channel. 
But if the traffic arrives in readtimeout interval, there is no data to read 
for readtimeout interval, the channel has to be closed due to readtimeout.

For more, the timeout callback is triggered by Netty. so we have no idea that 
when the timeout callback will be triggered. So it is difficult to guarantee 
the readEntry/addEntry operations be executed atomically before/after timeout 
callback.  


 CompactionTest failed
 -

 Key: BOOKKEEPER-258
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-258
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-server
Reporter: Flavio Junqueira
Assignee: Sijie Guo
Priority: Blocker
 Fix For: 4.1.0

 Attachments: BOOKKEEPER-258.diff


 {noformat}
 ---
 Test set: org.apache.bookkeeper.bookie.CompactionTest
 ---
 Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 32.557 sec 
  FAILURE!
 testCompactionSmallEntryLogs(org.apache.bookkeeper.bookie.CompactionTest)  
 Time elapsed: 6.507 sec   ERROR!
 org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException
 at 
 org.apache.bookkeeper.client.BKException.create(BKException.java:62)
 at 
 org.apache.bookkeeper.client.LedgerHandle.readEntries(LedgerHandle.java:347)
 at 
 org.apache.bookkeeper.bookie.CompactionTest.verifyLedger(CompactionTest.java:128)
 at 
 org.apache.bookkeeper.bookie.CompactionTest.testCompactionSmallEntryLogs(CompactionTest.java:317)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
 at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
 at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
 at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
 at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
 at 
 org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
 at 
 org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
 at 
 

[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-16 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276620#comment-13276620
 ] 

Rakesh R commented on BOOKKEEPER-253:
-

@Ivan

bq. but it needs a small modification to ensure that in the time period between 
deleting and acquiring the write permission and creating the using the 
ledger, and other node doesn't come in and do the same

Hope you are pointing to the window gap between 'delete  create' operations 
and chances of race condition.

Can we use ZooKeeper MultiTransactionRecord api like,
Op.delete(delete,
Op.create(create,
zk.multi(ops);

I feel, this would resolve the race condition. what's your opinion?

Also, I didn't fully understand the versioning concept what you are proposing?


 BKJM:Switch from standby to active fails and NN gets shut down due to delay 
 in clearing of lock
 ---

 Key: BOOKKEEPER-253
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-client
Reporter: suja s
Assignee: Uma Maheswara Rao G
Priority: Blocker

 Normal switch fails. 
 (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
 5000. By the time control comes to acquire lock the previous lock is not 
 released which leads to failure in lock acquisition by NN and NN gets 
 shutdown. Ideally it should have been done)
 =
 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
 Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
 already has it
 2012-05-09 20:15:29,732 FATAL 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
 recoverUnfinalizedSegments failed for required journal 
 (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
  stream=null))
 java.io.IOException: Could not acquire lock
 at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
 at 
 org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
 at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
 at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
 Scenario:
 Start ZKFCS, NNs
 NN1 is active and NN2 is standby
 Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-16 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276635#comment-13276635
 ] 

Rakesh R commented on BOOKKEEPER-253:
-

@Ivan,

Oh. You meant, recovering the inprogress_znode will release the write 
permission and startLogSegment will again try acquiring the write permission. 
In that case, we could not go with multi() option, since these are two 
different calls. I also feel, logic based on znode version would work.

 BKJM:Switch from standby to active fails and NN gets shut down due to delay 
 in clearing of lock
 ---

 Key: BOOKKEEPER-253
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-client
Reporter: suja s
Assignee: Uma Maheswara Rao G
Priority: Blocker

 Normal switch fails. 
 (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
 5000. By the time control comes to acquire lock the previous lock is not 
 released which leads to failure in lock acquisition by NN and NN gets 
 shutdown. Ideally it should have been done)
 =
 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
 Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
 already has it
 2012-05-09 20:15:29,732 FATAL 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
 recoverUnfinalizedSegments failed for required journal 
 (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
  stream=null))
 java.io.IOException: Could not acquire lock
 at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
 at 
 org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
 at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
 at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
 Scenario:
 Start ZKFCS, NNs
 NN1 is active and NN2 is standby
 Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-16 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276648#comment-13276648
 ] 

Ivan Kelly commented on BOOKKEEPER-253:
---

Yes, thats exactly what I mean. I've been trying to formulate a possible race 
for this for the last hours, but I haven't been able. Once I come up with one, 
ill post it here. 

 BKJM:Switch from standby to active fails and NN gets shut down due to delay 
 in clearing of lock
 ---

 Key: BOOKKEEPER-253
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-client
Reporter: suja s
Assignee: Uma Maheswara Rao G
Priority: Blocker

 Normal switch fails. 
 (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
 5000. By the time control comes to acquire lock the previous lock is not 
 released which leads to failure in lock acquisition by NN and NN gets 
 shutdown. Ideally it should have been done)
 =
 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
 Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
 already has it
 2012-05-09 20:15:29,732 FATAL 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
 recoverUnfinalizedSegments failed for required journal 
 (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
  stream=null))
 java.io.IOException: Could not acquire lock
 at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
 at 
 org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
 at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
 at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
 Scenario:
 Start ZKFCS, NNs
 NN1 is active and NN2 is standby
 Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-16 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276651#comment-13276651
 ] 

Ivan Kelly commented on BOOKKEEPER-253:
---

If the race doesn't exist, it would be possible to simply 'lock' using the 
inprogress znode.

 BKJM:Switch from standby to active fails and NN gets shut down due to delay 
 in clearing of lock
 ---

 Key: BOOKKEEPER-253
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-client
Reporter: suja s
Assignee: Uma Maheswara Rao G
Priority: Blocker

 Normal switch fails. 
 (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
 5000. By the time control comes to acquire lock the previous lock is not 
 released which leads to failure in lock acquisition by NN and NN gets 
 shutdown. Ideally it should have been done)
 =
 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
 Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
 already has it
 2012-05-09 20:15:29,732 FATAL 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
 recoverUnfinalizedSegments failed for required journal 
 (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
  stream=null))
 java.io.IOException: Could not acquire lock
 at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
 at 
 org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
 at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
 at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
 Scenario:
 Start ZKFCS, NNs
 NN1 is active and NN2 is standby
 Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-16 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276715#comment-13276715
 ] 

Ivan Kelly commented on BOOKKEEPER-253:
---

@Uma
This is what I was suggesting.

 BKJM:Switch from standby to active fails and NN gets shut down due to delay 
 in clearing of lock
 ---

 Key: BOOKKEEPER-253
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-client
Reporter: suja s
Assignee: Uma Maheswara Rao G
Priority: Blocker

 Normal switch fails. 
 (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
 5000. By the time control comes to acquire lock the previous lock is not 
 released which leads to failure in lock acquisition by NN and NN gets 
 shutdown. Ideally it should have been done)
 =
 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
 Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
 already has it
 2012-05-09 20:15:29,732 FATAL 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
 recoverUnfinalizedSegments failed for required journal 
 (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
  stream=null))
 java.io.IOException: Could not acquire lock
 at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
 at 
 org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
 at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
 at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
 Scenario:
 Start ZKFCS, NNs
 NN1 is active and NN2 is standby
 Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (BOOKKEEPER-237) Automatic recovery of under-replicated ledgers and its entries

2012-05-16 Thread Rakesh R (JIRA)

 [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated BOOKKEEPER-237:


Attachment: Auto Recovery Detection - distributed chain approach.doc

bq.I'm getting to realize that the main difference between what you're 
proposing and my half-baked proposal is that I'm trying to get rid of master 
accountant election and have each bookie individually figuring out what it has 
to replicate in the case of a crash. I believe that's the key difference. 
bq. Also, should design multiple groups and pointers to withstand multiple 
crashes. Instead can we make it simple by choosing one guy for monitoring?

I'm just attaching(Auto Recovery Detection - distributed chain approach.doc) my 
thoughts about, how does chaining based distributed approach works?. Hope you 
are also thinking about similar approach. Please review.


 Automatic recovery of under-replicated ledgers and its entries
 --

 Key: BOOKKEEPER-237
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
 Project: Bookkeeper
  Issue Type: New Feature
  Components: bookkeeper-client, bookkeeper-server
Affects Versions: 4.0.0
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: Auto Recovery Detection - distributed chain 
 approach.doc, Auto Recovery and Bookie sync-ups.pdf


 As per the current design of BookKeeper, if one of the BookKeeper server 
 dies, there is no automatic mechanism to identify and recover the under 
 replicated ledgers and its corresponding entries. This would lead to losing 
 the successfully written entries, which will be a critical problem in 
 sensitive systems. This document is trying to describe few proposals to 
 overcome these limitations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (BOOKKEEPER-146) TestConcurrentTopicAcquisition sometimes hangs

2012-05-16 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated BOOKKEEPER-146:
--

Attachment: BOOKKEEPER-146.diff

It's been running in a loop for 30 minutes now, and doesn't seem to be hanging. 
Main problem was that even after the hedwig client was closed, a subscription 
request could succeed and add a channel to the channel list, though hedwig 
client had already moved by the point at which it closed them.

 TestConcurrentTopicAcquisition sometimes hangs
 --

 Key: BOOKKEEPER-146
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-146
 Project: Bookkeeper
  Issue Type: Bug
Reporter: Ivan Kelly
Assignee: Sijie Guo
Priority: Blocker
 Fix For: 4.1.0

 Attachments: BOOKKEEPER-146.diff


 to repro
 {code}
 while [ $? = 0 ]; do mvn test -Dtest=TestConcurrentTopicAcquisition; done
 {code}
 The stacktrace where it hangs looks very like BOOKKEEPER-5
 {code}
 main prio=5 tid=102801000 nid=0x100601000 waiting on condition [1005ff000]
java.lang.Thread.State: TIMED_WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  7bd8e1090 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
   at 
 org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:107)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:143)
   at 
 org.apache.hedwig.client.netty.HedwigClientImpl.close(HedwigClientImpl.java:234)
   at org.apache.hedwig.client.HedwigClient.close(HedwigClient.java:70)
   at 
 org.apache.hedwig.server.topics.TestConcurrentTopicAcquisition.tearDown(TestConcurrentTopicAcquisition.java:99)
   at junit.framework.TestCase.runBare(TestCase.java:140)
   at junit.framework.TestResult$1.protect(TestResult.java:110)
   at junit.framework.TestResult.runProtected(TestResult.java:128)
   at junit.framework.TestResult.run(TestResult.java:113)
   at junit.framework.TestCase.run(TestCase.java:124)
   at junit.framework.TestSuite.runTest(TestSuite.java:232)
   at junit.framework.TestSuite.run(TestSuite.java:227)
   at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Review Request: BOOKKEEPER-146 TestConcurrentTopicAcquisition sometimes hangs

2012-05-16 Thread Ivan Kelly

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5144/
---

Review request for bookkeeper.


Summary
---

It's been running in a loop for 30 minutes now, and doesn't seem to be hanging. 
Main problem was that even after the hedwig client was closed, a subscription 
request could succeed and add a channel to the channel list, though hedwig 
client had already moved by the point at which it closed them.


This addresses bug BOOKKEEPER-146.
https://issues.apache.org/jira/browse/BOOKKEEPER-146


Diffs
-

  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigSubscriber.java
 0c8634c 
  hedwig-client/src/main/java/org/apache/hedwig/client/netty/WriteCallback.java 
a8552f4 
  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigPublisher.java 
603766c 
  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/ConnectCallback.java 
f5077b0 
  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigClientImpl.java
 806cdef 

Diff: https://reviews.apache.org/r/5144/diff


Testing
---


Thanks,

Ivan



[jira] [Commented] (BOOKKEEPER-146) TestConcurrentTopicAcquisition sometimes hangs

2012-05-16 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276825#comment-13276825
 ] 

jirapos...@reviews.apache.org commented on BOOKKEEPER-146:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5144/
---

Review request for bookkeeper.


Summary
---

It's been running in a loop for 30 minutes now, and doesn't seem to be hanging. 
Main problem was that even after the hedwig client was closed, a subscription 
request could succeed and add a channel to the channel list, though hedwig 
client had already moved by the point at which it closed them.


This addresses bug BOOKKEEPER-146.
https://issues.apache.org/jira/browse/BOOKKEEPER-146


Diffs
-

  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigSubscriber.java
 0c8634c 
  hedwig-client/src/main/java/org/apache/hedwig/client/netty/WriteCallback.java 
a8552f4 
  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigPublisher.java 
603766c 
  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/ConnectCallback.java 
f5077b0 
  
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigClientImpl.java
 806cdef 

Diff: https://reviews.apache.org/r/5144/diff


Testing
---


Thanks,

Ivan



 TestConcurrentTopicAcquisition sometimes hangs
 --

 Key: BOOKKEEPER-146
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-146
 Project: Bookkeeper
  Issue Type: Bug
Reporter: Ivan Kelly
Assignee: Ivan Kelly
Priority: Blocker
 Fix For: 4.1.0

 Attachments: BOOKKEEPER-146.diff


 to repro
 {code}
 while [ $? = 0 ]; do mvn test -Dtest=TestConcurrentTopicAcquisition; done
 {code}
 The stacktrace where it hangs looks very like BOOKKEEPER-5
 {code}
 main prio=5 tid=102801000 nid=0x100601000 waiting on condition [1005ff000]
java.lang.Thread.State: TIMED_WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  7bd8e1090 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
   at 
 org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:107)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:143)
   at 
 org.apache.hedwig.client.netty.HedwigClientImpl.close(HedwigClientImpl.java:234)
   at org.apache.hedwig.client.HedwigClient.close(HedwigClient.java:70)
   at 
 org.apache.hedwig.server.topics.TestConcurrentTopicAcquisition.tearDown(TestConcurrentTopicAcquisition.java:99)
   at junit.framework.TestCase.runBare(TestCase.java:140)
   at junit.framework.TestResult$1.protect(TestResult.java:110)
   at junit.framework.TestResult.runProtected(TestResult.java:128)
   at junit.framework.TestResult.run(TestResult.java:113)
   at junit.framework.TestCase.run(TestCase.java:124)
   at junit.framework.TestSuite.runTest(TestSuite.java:232)
   at junit.framework.TestSuite.run(TestSuite.java:227)
   at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (BOOKKEEPER-251) Noise error message printed when scanning entry log files those have been garbage collected.

2012-05-16 Thread Sijie Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sijie Guo updated BOOKKEEPER-251:
-

Attachment: BK-251.diff_v2

brought the patch to latest trunk

 Noise error message printed when scanning entry log files those have been 
 garbage collected.
 

 Key: BOOKKEEPER-251
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-251
 Project: Bookkeeper
  Issue Type: Improvement
  Components: bookkeeper-server
Affects Versions: 4.1.0
Reporter: Sijie Guo
Assignee: Sijie Guo
 Fix For: 4.1.0

 Attachments: BK-251.diff, BK-251.diff_v2


 currently, due to the messy scan mechanism deployed by garbage collector 
 thread. following noise error message would be printed when scanning those 
 entry log files has been garbage collected.
 {quote}
 2012-05-09 15:58:52,742 - INFO  
 [GarbageCollectorThread:GarbageCollectorThread@466] - Extracting entry log 
 meta from entryLogId: 0
 2012-05-09 15:58:52,743 - WARN  [GarbageCollectorThread:EntryLogger@386] - 
 Failed to get channel to scan entry log: 0.log
 2012-05-09 15:58:52,743 - WARN  
 [GarbageCollectorThread:GarbageCollectorThread@473] - Premature exception 
 when processing 0recovery will take care of the problem
 java.io.FileNotFoundException: No file for log 0
 at 
 org.apache.bookkeeper.bookie.EntryLogger.findFile(EntryLogger.java:366)
 at 
 org.apache.bookkeeper.bookie.EntryLogger.getChannelForLogId(EntryLogger.java:340)
 at 
 org.apache.bookkeeper.bookie.EntryLogger.scanEntryLog(EntryLogger.java:384)
 at 
 org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLog(GarbageCollectorThread.java:485)
 at 
 org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:470)
 at 
 org.apache.bookkeeper.bookie.GarbageCollectorThread.run(GarbageCollectorThread.java:189)
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Review Request: BOOKKEEPER-146 TestConcurrentTopicAcquisition sometimes hangs

2012-05-16 Thread Sijie Guo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5144/#review7950
---


thanks Ivan. the patch seems great. just some slight comments.


hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigPublisher.java
https://reviews.apache.org/r/5144/#comment17288

it would be better to move line 'closed = true;' to the top of close(). 
because you used closed to avoid new channel being storeHost2ChannelMapping.



hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigSubscriber.java
https://reviews.apache.org/r/5144/#comment17289

do we need to put the closing logic in closeLock synchronization block?

if we had acquired closeLock and set closed to true, no channel could be 
put into topicSubscriber2Channel again.


- Sijie


On 2012-05-16 15:48:50, Ivan Kelly wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/5144/
 ---
 
 (Updated 2012-05-16 15:48:50)
 
 
 Review request for bookkeeper.
 
 
 Summary
 ---
 
 It's been running in a loop for 30 minutes now, and doesn't seem to be 
 hanging. Main problem was that even after the hedwig client was closed, a 
 subscription request could succeed and add a channel to the channel list, 
 though hedwig client had already moved by the point at which it closed them.
 
 
 This addresses bug BOOKKEEPER-146.
 https://issues.apache.org/jira/browse/BOOKKEEPER-146
 
 
 Diffs
 -
 
   
 hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigSubscriber.java
  0c8634c 
   
 hedwig-client/src/main/java/org/apache/hedwig/client/netty/WriteCallback.java 
 a8552f4 
   
 hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigPublisher.java
  603766c 
   
 hedwig-client/src/main/java/org/apache/hedwig/client/netty/ConnectCallback.java
  f5077b0 
   
 hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigClientImpl.java
  806cdef 
 
 Diff: https://reviews.apache.org/r/5144/diff
 
 
 Testing
 ---
 
 
 Thanks,
 
 Ivan
 




[jira] [Commented] (BOOKKEEPER-146) TestConcurrentTopicAcquisition sometimes hangs

2012-05-16 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277567#comment-13277567
 ] 

jirapos...@reviews.apache.org commented on BOOKKEEPER-146:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5144/#review7950
---


thanks Ivan. the patch seems great. just some slight comments.


hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigPublisher.java
https://reviews.apache.org/r/5144/#comment17288

it would be better to move line 'closed = true;' to the top of close(). 
because you used closed to avoid new channel being storeHost2ChannelMapping.



hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigSubscriber.java
https://reviews.apache.org/r/5144/#comment17289

do we need to put the closing logic in closeLock synchronization block?

if we had acquired closeLock and set closed to true, no channel could be 
put into topicSubscriber2Channel again.


- Sijie


On 2012-05-16 15:48:50, Ivan Kelly wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/5144/
bq.  ---
bq.  
bq.  (Updated 2012-05-16 15:48:50)
bq.  
bq.  
bq.  Review request for bookkeeper.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  It's been running in a loop for 30 minutes now, and doesn't seem to be 
hanging. Main problem was that even after the hedwig client was closed, a 
subscription request could succeed and add a channel to the channel list, 
though hedwig client had already moved by the point at which it closed them.
bq.  
bq.  
bq.  This addresses bug BOOKKEEPER-146.
bq.  https://issues.apache.org/jira/browse/BOOKKEEPER-146
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigSubscriber.java
 0c8634c 
bq.
hedwig-client/src/main/java/org/apache/hedwig/client/netty/WriteCallback.java 
a8552f4 
bq.
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigPublisher.java 
603766c 
bq.
hedwig-client/src/main/java/org/apache/hedwig/client/netty/ConnectCallback.java 
f5077b0 
bq.
hedwig-client/src/main/java/org/apache/hedwig/client/netty/HedwigClientImpl.java
 806cdef 
bq.  
bq.  Diff: https://reviews.apache.org/r/5144/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ivan
bq.  
bq.



 TestConcurrentTopicAcquisition sometimes hangs
 --

 Key: BOOKKEEPER-146
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-146
 Project: Bookkeeper
  Issue Type: Bug
Reporter: Ivan Kelly
Assignee: Ivan Kelly
Priority: Blocker
 Fix For: 4.1.0

 Attachments: BOOKKEEPER-146.diff


 to repro
 {code}
 while [ $? = 0 ]; do mvn test -Dtest=TestConcurrentTopicAcquisition; done
 {code}
 The stacktrace where it hangs looks very like BOOKKEEPER-5
 {code}
 main prio=5 tid=102801000 nid=0x100601000 waiting on condition [1005ff000]
java.lang.Thread.State: TIMED_WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  7bd8e1090 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
   at 
 org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:107)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:143)
   at 
 org.apache.hedwig.client.netty.HedwigClientImpl.close(HedwigClientImpl.java:234)
   at org.apache.hedwig.client.HedwigClient.close(HedwigClient.java:70)
   at 
 org.apache.hedwig.server.topics.TestConcurrentTopicAcquisition.tearDown(TestConcurrentTopicAcquisition.java:99)
   at junit.framework.TestCase.runBare(TestCase.java:140)
   at junit.framework.TestResult$1.protect(TestResult.java:110)
   at junit.framework.TestResult.runProtected(TestResult.java:128)
   at junit.framework.TestResult.run(TestResult.java:113)
   at junit.framework.TestCase.run(TestCase.java:124)
   at junit.framework.TestSuite.runTest(TestSuite.java:232)
   at junit.framework.TestSuite.run(TestSuite.java:227)
   at 
 

[jira] [Commented] (BOOKKEEPER-263) ZK ledgers root path is hard coded

2012-05-16 Thread Sijie Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277576#comment-13277576
 ] 

Sijie Guo commented on BOOKKEEPER-263:
--

thanks Aniruddha. the patch seems good. just one comment, sees that 
AVAILABLE_NODE spreads over several files. could we consider moving it to a 
common place (which could be shared by client and server), such as 
AbstractConfiguration to have a method getAvailableBookiesPath(), which is 
similar what Hedwig did in ServerConfiguration to manage its znode path.

 ZK ledgers root path is hard coded
 --

 Key: BOOKKEEPER-263
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-263
 Project: Bookkeeper
  Issue Type: Bug
  Components: bookkeeper-client, bookkeeper-server
Affects Versions: 4.1.0
Reporter: Aniruddha
Assignee: Aniruddha
 Fix For: 4.1.0

 Attachments: BK-263.patch


 Currently the ZK ledger root path is not picked up from the config file (It 
 is hardcoded). This patch fixes this. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira