[jira] [Commented] (HDFS-4266) BKJM: Separate write and ack quorum

2014-07-18 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066357#comment-14066357
 ] 

Ivan Kelly commented on HDFS-4266:
--

lgtm + 1

> BKJM: Separate write and ack quorum
> ---
>
> Key: HDFS-4266
> URL: https://issues.apache.org/jira/browse/HDFS-4266
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4266.patch, 002-HDFS-4266.patch, 
> 003-HDFS-4266.patch
>
>
> BOOKKEEPER-208 allows the ack and write quorums to be different sizes to 
> allow writes to be unaffected by any bookie failure. BKJM should be able to 
> take advantage of this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-4266) BKJM: Separate write and ack quorum

2014-07-18 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066305#comment-14066305
 ] 

Ivan Kelly commented on HDFS-4266:
--

That would work.

> BKJM: Separate write and ack quorum
> ---
>
> Key: HDFS-4266
> URL: https://issues.apache.org/jira/browse/HDFS-4266
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4266.patch, 002-HDFS-4266.patch
>
>
> BOOKKEEPER-208 allows the ack and write quorums to be different sizes to 
> allow writes to be unaffected by any bookie failure. BKJM should be able to 
> take advantage of this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-4265) BKJM doesn't take advantage of speculative reads

2014-07-17 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064972#comment-14064972
 ] 

Ivan Kelly commented on HDFS-4265:
--

lgtm +1

> BKJM doesn't take advantage of speculative reads
> 
>
> Key: HDFS-4265
> URL: https://issues.apache.org/jira/browse/HDFS-4265
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: 2.2.0
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4265.patch, 002-HDFS-4265.patch, 
> 003-HDFS-4265.patch, 004-HDFS-4265.patch
>
>
> BookKeeperEditLogInputStream reads entry at a time, so it doesn't take 
> advantage of the speculative read mechanism introduced by BOOKKEEPER-336.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-4266) BKJM: Separate write and ack quorum

2014-07-10 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057654#comment-14057654
 ] 

Ivan Kelly commented on HDFS-4266:
--

Looks good to me, but the test could be a little flakey as it seems to rely on 
the fact that it will finish before the client socket read timeout triggers, 
which would cause the client to try to replace the sleeping bookie, and then 
fail because it can't build an ensemble. This behaviour is something we plan to 
change (if you have ack quorum, the client shouldn't fail) but we haven't done 
so yet.

> BKJM: Separate write and ack quorum
> ---
>
> Key: HDFS-4266
> URL: https://issues.apache.org/jira/browse/HDFS-4266
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4266.patch, 002-HDFS-4266.patch
>
>
> BOOKKEEPER-208 allows the ack and write quorums to be different sizes to 
> allow writes to be unaffected by any bookie failure. BKJM should be able to 
> take advantage of this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-4265) BKJM doesn't take advantage of speculative reads

2014-07-10 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057646#comment-14057646
 ] 

Ivan Kelly commented on HDFS-4265:
--

I think the constant names need MS also, 
BKJM_BOOKKEEPER_SPECULATIVE_READ_TIMEOUT -> 
BKJM_BOOKKEEPER_SPECULATIVE_READ_TIMEOUT_MS

Where you set it in the code of the test, the unit is unclear, so it's better 
to be explicit in the constant.

In the test, how does it verify that speculative reads are taking place? I 
think for this, you should set the bookkeeper client request timeout to 10 days 
or something.

> BKJM doesn't take advantage of speculative reads
> 
>
> Key: HDFS-4265
> URL: https://issues.apache.org/jira/browse/HDFS-4265
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: 2.2.0
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4265.patch, 002-HDFS-4265.patch, 
> 003-HDFS-4265.patch
>
>
> BookKeeperEditLogInputStream reads entry at a time, so it doesn't take 
> advantage of the speculative read mechanism introduced by BOOKKEEPER-336.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-4265) BKJM doesn't take advantage of speculative reads

2014-07-07 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053820#comment-14053820
 ] 

Ivan Kelly commented on HDFS-4265:
--

Change looks good, but you should add units to the conf key.
dfs.namenode.bookkeeperjournal.speculativeReadTimeout -> 
dfs.namenode.bookkeeperjournal.speculativeReadTimeoutMs

> BKJM doesn't take advantage of speculative reads
> 
>
> Key: HDFS-4265
> URL: https://issues.apache.org/jira/browse/HDFS-4265
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: 2.2.0
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4265.patch, 002-HDFS-4265.patch
>
>
> BookKeeperEditLogInputStream reads entry at a time, so it doesn't take 
> advantage of the speculative read mechanism introduced by BOOKKEEPER-336.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-5411) Update Bookkeeper dependency to 4.2.3

2014-07-07 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053816#comment-14053816
 ] 

Ivan Kelly commented on HDFS-5411:
--

change lgtm +1

> Update Bookkeeper dependency to 4.2.3
> -
>
> Key: HDFS-5411
> URL: https://issues.apache.org/jira/browse/HDFS-5411
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Robert Rati
>Assignee: Rakesh R
>Priority: Minor
> Attachments: HDFS-5411.patch, HDFS-5411.patch
>
>
> Update the bookkeeper dependency to 4.2.3.  This eases compilation on Fedora 
> platforms



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-4266) BKJM: Separate write and ack quorum

2013-07-19 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713821#comment-13713821
 ] 

Ivan Kelly commented on HDFS-4266:
--

The patch makes an ack quorum mandatory. This breaks existing configs. The ack 
quorum should default to the write quorum, if the configuration is missing.

> BKJM: Separate write and ack quorum
> ---
>
> Key: HDFS-4266
> URL: https://issues.apache.org/jira/browse/HDFS-4266
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4266.patch
>
>
> BOOKKEEPER-208 allows the ack and write quorums to be different sizes to 
> allow writes to be unaffected by any bookie failure. BKJM should be able to 
> take advantage of this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4265) BKJM doesn't take advantage of speculative reads

2013-07-19 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713827#comment-13713827
 ] 

Ivan Kelly commented on HDFS-4265:
--

+1 for the patch.

> BKJM doesn't take advantage of speculative reads
> 
>
> Key: HDFS-4265
> URL: https://issues.apache.org/jira/browse/HDFS-4265
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Reporter: Ivan Kelly
>Assignee: Rakesh R
> Attachments: 001-HDFS-4265.patch
>
>
> BookKeeperEditLogInputStream reads entry at a time, so it doesn't take 
> advantage of the speculative read mechanism introduced by BOOKKEEPER-336.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4445) All BKJM ledgers are not checked while tailing, So failover will fail.

2013-01-31 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13567844#comment-13567844
 ] 

Ivan Kelly commented on HDFS-4445:
--

lgtm. +1


> All BKJM ledgers are not checked while tailing, So failover will fail.
> --
>
> Key: HDFS-4445
> URL: https://issues.apache.org/jira/browse/HDFS-4445
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.3-alpha
>Reporter: Vinay
>Assignee: Vinay
>Priority: Blocker
> Attachments: HDFS-4445.patch
>
>
> After the fix of HDFS-4130, all editlog nodes are not iterated if first edit 
> are below fromTxId
> Problem part is below code inside 
> BookKeeperJournalManager#selectInputStreams(..)
> {code}if (fromTxId >= l.getFirstTxId() && fromTxId <= lastTxId) {
>   LedgerHandle h;
>   if (l.isInProgress()) { // we don't want to fence the current 
> journal
> h = bkc.openLedgerNoRecovery(l.getLedgerId(), digestType,
> digestpw.getBytes());
>   } else {
> h = bkc
> .openLedger(l.getLedgerId(), digestType, digestpw.getBytes());
>   }
>   elis = new BookKeeperEditLogInputStream(h, l);
>   elis.skipTo(fromTxId);
> } else {
>   return;
> }{code}
> The else block should have continue statement instead of return.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4286) Changes from BOOKKEEPER-203 broken capability of including bookkeeper-server jar in hidden package of BKJM

2012-12-07 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-4286:
-

Issue Type: Sub-task  (was: Bug)
Parent: HDFS-3399

> Changes from BOOKKEEPER-203 broken capability of including bookkeeper-server 
> jar in hidden package of BKJM
> --
>
> Key: HDFS-4286
> URL: https://issues.apache.org/jira/browse/HDFS-4286
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Vinay
> Fix For: 3.0.0, 2.0.3-alpha
>
>
> BOOKKEEPER-203 introduced changes to LedgerLayout to include 
> ManagerFactoryClass instead of ManagerFactoryName.
> So because of this, BKJM cannot shade the bookkeeper-server jar inside BKJM 
> jar
> LAYOUT znode created by BookieServer is not readable by the BKJM as it have 
> classes in hidden packages. (same problem vice versa)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Moved] (HDFS-4286) Changes from BOOKKEEPER-203 broken capability of including bookkeeper-server jar in hidden package of BKJM

2012-12-07 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly moved BOOKKEEPER-478 to HDFS-4286:
-

Fix Version/s: (was: 4.2.0)
   2.0.3-alpha
   3.0.0
  Key: HDFS-4286  (was: BOOKKEEPER-478)
  Project: Hadoop HDFS  (was: Bookkeeper)

> Changes from BOOKKEEPER-203 broken capability of including bookkeeper-server 
> jar in hidden package of BKJM
> --
>
> Key: HDFS-4286
> URL: https://issues.apache.org/jira/browse/HDFS-4286
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinay
> Fix For: 3.0.0, 2.0.3-alpha
>
>
> BOOKKEEPER-203 introduced changes to LedgerLayout to include 
> ManagerFactoryClass instead of ManagerFactoryName.
> So because of this, BKJM cannot shade the bookkeeper-server jar inside BKJM 
> jar
> LAYOUT znode created by BookieServer is not readable by the BKJM as it have 
> classes in hidden packages. (same problem vice versa)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4266) BKJM: Separate write and ack quorum

2012-12-04 Thread Ivan Kelly (JIRA)
Ivan Kelly created HDFS-4266:


 Summary: BKJM: Separate write and ack quorum
 Key: HDFS-4266
 URL: https://issues.apache.org/jira/browse/HDFS-4266
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Ivan Kelly


BOOKKEEPER-208 allows the ack and write quorums to be different sizes to allow 
writes to be unaffected by any bookie failure. BKJM should be able to take 
advantage of this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4265) BKJM doesn't take advantage of speculative reads

2012-12-04 Thread Ivan Kelly (JIRA)
Ivan Kelly created HDFS-4265:


 Summary: BKJM doesn't take advantage of speculative reads
 Key: HDFS-4265
 URL: https://issues.apache.org/jira/browse/HDFS-4265
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Ivan Kelly


BookKeeperEditLogInputStream reads entry at a time, so it doesn't take 
advantage of the speculative read mechanism introduced by BOOKKEEPER-336.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4130) BKJM: The reading for editlog at NN starting using bkjm is not efficient

2012-12-04 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509677#comment-13509677
 ] 

Ivan Kelly commented on HDFS-4130:
--

The patch looks good to me +1.

> BKJM: The reading for editlog at NN starting using bkjm  is not efficient
> -
>
> Key: HDFS-4130
> URL: https://issues.apache.org/jira/browse/HDFS-4130
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, performance
>Affects Versions: 3.0.0, 2.0.2-alpha
>Reporter: Han Xiao
> Attachments: HDFS-4130.patch, HDFS-4130-v2.patch, HDFS-4130-v3.patch
>
>
> Now, the method of BookKeeperJournalManager.selectInputStreams is written 
> like:
> while (true) {
>   EditLogInputStream elis;
>   try {
> elis = getInputStream(fromTxId, inProgressOk);
>   } catch (IOException e) {
> LOG.error(e);
> return;
>   }
>   if (elis == null) {
> return;
>   }
>   streams.add(elis);
>   if (elis.getLastTxId() == HdfsConstants.INVALID_TXID) {
> return;
>   }
>   fromTxId = elis.getLastTxId() + 1;
> }
>  
> EditLogInputstream is got from getInputStream(), which will read the ledgers 
> from zookeeper in each calling.
> This will be a larger cost of times when the the number ledgers becomes large.
> The reading of ledgers from zk is not necessary for every calling of 
> getInputStream().
> The log of time wasting here is as follows:
> 2012-10-30 16:44:52,995 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Caching file names occuring more than 10 times
> 2012-10-30 16:49:24,643 INFO 
> hidden.bkjournal.org.apache.bookkeeper.proto.PerChannelBookieClient: 
> Successfully connected to bookie: /167.52.1.121:318
> The stack of the process when blocking between the two lines of log is like:
> "main" prio=10 tid=0x4011f000 nid=0x39ba in Object.wait() 
> \[0x7fca020fe000\]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:485)
> at 
> hidden.bkjournal.org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)
> \- locked <0x0006fb8495a8> (a 
> hidden.bkjournal.org.apache.zookeeper.ClientCnxn$Packet)
> at 
> hidden.bkjournal.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)
> at 
> org.apache.hadoop.contrib.bkjournal.utils.RetryableZookeeper.getData(RetryableZookeeper.java:501)
> at 
> hidden.bkjournal.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1160)
> at 
> org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:725)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getInputStream(BookKeeperJournalManager.java:442)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.selectInputStreams(BookKeeperJournalManager.java:480)
> 
> betweent different time, the diff of stack is:
> diff stack stack2
> 1c1
> < 2012-10-30 16:44:53
> ---
> > 2012-10-30 16:46:17
> 106c106
> <   - locked <0x0006fb8495a8> (a 
> hidden.bkjournal.org.apache.zookeeper.ClientCnxn$Packet)
> ---
> >   - locked <0x0006fae58468> (a 
> > hidden.bkjournal.org.apache.zookeeper.ClientCnxn$Packet)
> In our environment, the waiting time could even reach to tens of minutes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4154) BKJM: Two namenodes usng bkjm can race to create the version znode

2012-11-16 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498998#comment-13498998
 ] 

Ivan Kelly commented on HDFS-4154:
--

[~umamaheswararao] Yes, it's very straightforward to fix. 

[~yians] It would be great if you did that :)

> BKJM: Two namenodes usng bkjm can race to create the version znode
> --
>
> Key: HDFS-4154
> URL: https://issues.apache.org/jira/browse/HDFS-4154
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
>
> nd one will get the following error.
> 2012-11-06 10:04:00,200 INFO 
> hidden.bkjournal.org.apache.zookeeper.ClientCnxn: Session establishment 
> complete on server 109-231-69-172.flexiscale.com/109.231.69.172:2181, 
> sessionid = 0x13ad528fcfe0005, negotiated timeout = 4000
> 2012-11-06 10:04:00,710 FATAL 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
> java.lang.IllegalArgumentException: Unable to construct journal, 
> bookkeeper://109.231.69.172:2181;109.231.69.173:2181;109.231.69.174:2181/hdfsjournal
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1251)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:226)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.initSharedJournalsForRead(FSEditLog.java:206)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:657)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:590)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:544)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:423)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:385)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:401)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:435)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:611)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:592)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1135)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1201)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1249)
> ... 14 more
> Caused by: java.io.IOException: Error initializing zk
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.(BookKeeperJournalManager.java:233)
> ... 19 more
> Caused by: 
> hidden.bkjournal.org.apache.zookeeper.KeeperException$NodeExistsException: 
> KeeperErrorCode = NodeExists for /hdfsjournal/version
> at 
> hidden.bkjournal.org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
> at 
> hidden.bkjournal.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at 
> hidden.bkjournal.org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:778)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.(BookKeeperJournalManager.java:222)
> ... 19 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3623) BKJM: zkLatchWaitTimeout hard coded to 6000. Make use of ZKSessionTimeout instead.

2012-11-15 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497957#comment-13497957
 ] 

Ivan Kelly commented on HDFS-3623:
--

lgtm +1

> BKJM: zkLatchWaitTimeout hard coded to 6000. Make use of ZKSessionTimeout 
> instead.
> --
>
> Key: HDFS-3623
> URL: https://issues.apache.org/jira/browse/HDFS-3623
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-3623.patch, HDFS-3628.patch
>
>
> {code}
> if (!zkConnectLatch.await(6000, TimeUnit.MILLISECONDS)) {
> {code}
> we can make use of session timeout instead of hardcoding this value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4154) Two namenodes usng bkjm can race to create the version znode

2012-11-06 Thread Ivan Kelly (JIRA)
Ivan Kelly created HDFS-4154:


 Summary: Two namenodes usng bkjm can race to create the version 
znode
 Key: HDFS-4154
 URL: https://issues.apache.org/jira/browse/HDFS-4154
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.0.0, 2.0.3-alpha


nd one will get the following error.

2012-11-06 10:04:00,200 INFO hidden.bkjournal.org.apache.zookeeper.ClientCnxn: 
Session establishment complete on server 
109-231-69-172.flexiscale.com/109.231.69.172:2181, sessionid = 
0x13ad528fcfe0005, negotiated timeout = 4000
2012-11-06 10:04:00,710 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: 
Exception in namenode join
java.lang.IllegalArgumentException: Unable to construct journal, 
bookkeeper://109.231.69.172:2181;109.231.69.173:2181;109.231.69.174:2181/hdfsjournal
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1251)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:226)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.initSharedJournalsForRead(FSEditLog.java:206)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:657)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:590)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:259)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:544)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:423)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:385)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:401)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:435)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:611)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:592)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1135)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1201)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1249)
... 14 more
Caused by: java.io.IOException: Error initializing zk
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.(BookKeeperJournalManager.java:233)
... 19 more
Caused by: 
hidden.bkjournal.org.apache.zookeeper.KeeperException$NodeExistsException: 
KeeperErrorCode = NodeExists for /hdfsjournal/version
at 
hidden.bkjournal.org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
at 
hidden.bkjournal.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at 
hidden.bkjournal.org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:778)
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.(BookKeeperJournalManager.java:222)
... 19 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-10-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Attachment: 0001-HDFS-3809.-Make-BKJM-use-protobufs-for-all-serializa.patch

ah, I had generated from what had been committed, so thats why it was there. 
removed now.

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 
> 0001-HDFS-3809.-Make-BKJM-use-protobufs-for-all-serializa.patch, 
> 0001-HDFS-3809.-Make-BKJM-use-protobufs-for-all-serializa.patch, 
> 0004-HDFS-3809-for-branch-2.patch, HDFS-3809.diff, HDFS-3809.diff, 
> HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-10-30 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487137#comment-13487137
 ] 

Ivan Kelly commented on HDFS-3809:
--

HDFS-4121 caused this. New patch attached.

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 
> 0001-HDFS-3809.-Make-BKJM-use-protobufs-for-all-serializa.patch, 
> 0004-HDFS-3809-for-branch-2.patch, HDFS-3809.diff, HDFS-3809.diff, 
> HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-10-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Target Version/s: 2.0.1-alpha, 3.0.0  (was: 3.0.0, 2.0.1-alpha)
  Status: Patch Available  (was: Reopened)

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 
> 0001-HDFS-3809.-Make-BKJM-use-protobufs-for-all-serializa.patch, 
> 0004-HDFS-3809-for-branch-2.patch, HDFS-3809.diff, HDFS-3809.diff, 
> HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-10-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Attachment: 0001-HDFS-3809.-Make-BKJM-use-protobufs-for-all-serializa.patch

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 
> 0001-HDFS-3809.-Make-BKJM-use-protobufs-for-all-serializa.patch, 
> 0004-HDFS-3809-for-branch-2.patch, HDFS-3809.diff, HDFS-3809.diff, 
> HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3789) JournalManager#format() should be able to throw IOException

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3789:
-

Fix Version/s: 2.0.3-alpha

> JournalManager#format() should be able to throw IOException
> ---
>
> Key: HDFS-3789
> URL: https://issues.apache.org/jira/browse/HDFS-3789
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 0003-HDFS-3789-for-branch-2.patch, HDFS-3789.diff
>
>
> Currently JournalManager#format cannot throw any exception. As format can 
> fail, we should be able to propogate this failure upwards. Otherwise, format 
> will fail silently, and the admin will start using the cluster with a 
> failed/unusable journal manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Fix Version/s: 2.0.3-alpha

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 0004-HDFS-3809-for-branch-2.patch, HDFS-3809.diff, 
> HDFS-3809.diff, HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Attachment: 0004-HDFS-3809-for-branch-2.patch

Some offset changes.

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 0004-HDFS-3809-for-branch-2.patch, HDFS-3809.diff, 
> HDFS-3809.diff, HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3789) JournalManager#format() should be able to throw IOException

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3789:
-

Attachment: 0003-HDFS-3789-for-branch-2.patch

Only offsets have changed, compared to trunk version. Depends on HDFS-3695 & 
HDFS-3573

> JournalManager#format() should be able to throw IOException
> ---
>
> Key: HDFS-3789
> URL: https://issues.apache.org/jira/browse/HDFS-3789
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0
>
> Attachments: 0003-HDFS-3789-for-branch-2.patch, HDFS-3789.diff
>
>
> Currently JournalManager#format cannot throw any exception. As format can 
> fail, we should be able to propogate this failure upwards. Otherwise, format 
> will fail silently, and the admin will start using the cluster with a 
> failed/unusable journal manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3695) Genericize format() to non-file JournalManagers

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3695:
-

Attachment: 0002-HDFS-3695-for-branch-2.patch

Only offsets and conflict resolution compared to trunk. No actual code 
difference.

> Genericize format() to non-file JournalManagers
> ---
>
> Key: HDFS-3695
> URL: https://issues.apache.org/jira/browse/HDFS-3695
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: QuorumJournalManager (HDFS-3077)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 0002-HDFS-3695-for-branch-2.patch, hdfs-3695.txt, 
> hdfs-3695.txt, hdfs-3695.txt
>
>
> Currently, the "namenode -format" and "namenode -initializeSharedEdits" 
> commands do not understand how to do anything with non-file-based shared 
> storage. This affects both BookKeeperJournalManager and QuorumJournalManager.
> This JIRA is to plumb through the formatting of edits directories using 
> pluggable journal manager implementations so that no separate step needs to 
> be taken to format them -- the same commands will work for NFS-based storage 
> or one of the alternate implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3695) Genericize format() to non-file JournalManagers

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3695:
-

Fix Version/s: 2.0.3-alpha

> Genericize format() to non-file JournalManagers
> ---
>
> Key: HDFS-3695
> URL: https://issues.apache.org/jira/browse/HDFS-3695
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: QuorumJournalManager (HDFS-3077)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: hdfs-3695.txt, hdfs-3695.txt, hdfs-3695.txt
>
>
> Currently, the "namenode -format" and "namenode -initializeSharedEdits" 
> commands do not understand how to do anything with non-file-based shared 
> storage. This affects both BookKeeperJournalManager and QuorumJournalManager.
> This JIRA is to plumb through the formatting of edits directories using 
> pluggable journal manager implementations so that no separate step needs to 
> be taken to format them -- the same commands will work for NFS-based storage 
> or one of the alternate implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3573) Supply NamespaceInfo when instantiating JournalManagers

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3573:
-

Attachment: 0001-HDFS-3573-for-branch-2.patch

Patch for branch-2 very similar to trunk, but distributed upgrade is removed, 
as per HDFS-2686

> Supply NamespaceInfo when instantiating JournalManagers
> ---
>
> Key: HDFS-3573
> URL: https://issues.apache.org/jira/browse/HDFS-3573
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 3.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: 0001-HDFS-3573-for-branch-2.patch, hdfs-3573.txt, 
> hdfs-3573.txt, hdfs-3573.txt, hdfs-3573.txt
>
>
> Currently, the JournalManagers are instantiated before the NamespaceInfo is 
> loaded from local storage directories. This is problematic since the JM may 
> want to verify that the storage info associated with the journal matches the 
> NN which is starting up (eg to prevent an operator accidentally configuring 
> two clusters against the same remote journal storage). This JIRA rejiggers 
> the initialization sequence so that the JMs receive NamespaceInfo as a 
> constructor argument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3573) Supply NamespaceInfo when instantiating JournalManagers

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3573:
-

Fix Version/s: 2.0.3-alpha

> Supply NamespaceInfo when instantiating JournalManagers
> ---
>
> Key: HDFS-3573
> URL: https://issues.apache.org/jira/browse/HDFS-3573
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 3.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 3.0.0, 2.0.3-alpha
>
> Attachments: hdfs-3573.txt, hdfs-3573.txt, hdfs-3573.txt, 
> hdfs-3573.txt
>
>
> Currently, the JournalManagers are instantiated before the NamespaceInfo is 
> loaded from local storage directories. This is problematic since the JM may 
> want to verify that the storage info associated with the journal matches the 
> NN which is starting up (eg to prevent an operator accidentally configuring 
> two clusters against the same remote journal storage). This JIRA rejiggers 
> the initialization sequence so that the JMs receive NamespaceInfo as a 
> constructor argument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3769) standby namenode become active fails because starting log segment fail on shared storage

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3769:
-

Issue Type: Sub-task  (was: Bug)
Parent: HDFS-3399

> standby namenode become active fails because starting log segment fail on 
> shared storage
> 
>
> Key: HDFS-3769
> URL: https://issues.apache.org/jira/browse/HDFS-3769
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: 2.0.0-alpha
> Environment: 3 datanode:158.1.132.18,158.1.132.19,160.161.0.143
> 2 namenode:158.1.131.18,158.1.132.19
> 3 zk:158.1.132.18,158.1.132.19,160.161.0.143
> 3 bookkeeper:158.1.132.18,158.1.132.19,160.161.0.143
> ensemble-size:2,quorum-size:2
>Reporter: liaowenrui
>Priority: Critical
> Fix For: 3.0.0, 2.0.3-alpha
>
>
> 2012-08-06 15:09:46,264 ERROR 
> org.apache.hadoop.contrib.bkjournal.utils.RetryableZookeeper: Node 
> /ledgers/available already exists and this is not a retry
> 2012-08-06 15:09:46,264 INFO 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager: Successfully 
> created bookie available path : /ledgers/available
> 2012-08-06 15:09:46,273 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in 
> /opt/namenodeHa/hadoop-2.0.1/hadoop-root/dfs/name/current
> 2012-08-06 15:09:46,277 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest 
> edits from old active before taking over writer role in edits logs.
> 2012-08-06 15:09:46,363 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication 
> and invalidation queues...
> 2012-08-06 15:09:46,363 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all 
> datandoes as stale
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of 
> blocks= 239
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of invalid 
> blocks  = 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of 
> under-replicated blocks = 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of  
> over-replicated blocks = 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of blocks 
> being written= 0
> 2012-08-06 15:09:46,383 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing 
> edit logs at txnid 2354
> 2012-08-06 15:09:46,471 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 2354
> 2012-08-06 15:09:46,472 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 
> 2354 failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@4eda1515,
>  stream=null))
> java.io.IOException: We've already seen 2354. A new stream cannot be created 
> with it
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.startLogSegment(BookKeeperJournalManager.java:297)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:86)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$2.apply(JournalSet.java:182)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:179)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:894)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:268)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:618)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1322)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1230)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.pro

[jira] [Updated] (HDFS-3908) In HA mode, when there is a ledger in BK missing, which is generated after the last checkpoint, NN can not restore it.

2012-10-24 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3908:
-

Issue Type: Sub-task  (was: Bug)
Parent: HDFS-3399

> In HA mode, when there is a ledger in BK missing, which is generated after 
> the last checkpoint, NN can not restore it.
> --
>
> Key: HDFS-3908
> URL: https://issues.apache.org/jira/browse/HDFS-3908
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.1-alpha
>Reporter: Han Xiao
>
> If not HA, when the num of edits.dir is larger than 1. Missing of one editlog 
> file in a dir will not relust problem cause of the replica in the other dir. 
> However, when in HA mode(using BK as ShareStorage), if an ledger missing, the 
> missing ledger will not restored at the phase of NN starting even if the 
> related editlog file existing in local dir.
> The missing maintains when NN is still in standby state. However, when the NN 
> enters active state, it will read the editlog file(related to the missing 
> ledger) in local. But, unfortunately, the ledger after the missing one in BK 
> can't be readed at such a phase(cause of gap).
> Therefore in the following situation, editlogs will not be restored even 
> there is an editlog file either in BK or in local dir: 
> In such a stituation, editlog can't be restored:
> 1、fsiamge file: fsimage_0005946.md5
> 2、legder in zk:
>   \[zk: localhost:2181(CONNECTED) 0\] ls 
> /hdfsEdit/ledgers/edits_00594
>   edits_005941_005942
>   edits_005943_005944
>   edits_005945_005946
>   edits_005949_005949   
> (missing edits_005947_005948)
> 3、editlog in local editlog dir:
>   \-rw-r--r-- 1 root root  30 Sep  8 03:24 
> edits_0005947-0005948
>   \-rw-r--r-- 1 root root 1048576 Sep  8 03:35 
> edits_0005950-0005950
>   \-rw-r--r-- 1 root root 1048576 Sep  8 04:42 
> edits_0005951-0005951
>   (miss edits_0005949-0005919)
> 4、and the seen_txid
>   vm2:/tmp/hadoop-root/dfs/name/current # cat seen_txid
>   5949
> Here, we want to restored editlog from txid 5946(image) to txid 
> 5949(seen_txid). The 5947-5948 is missing in BK, 5949-5949 is missing in 
> local dir.
> When start the NN, the following exception is thrown:
> 2012-09-08 06:26:10,031 FATAL 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring 
> NN shutdown. Shutting down immediately.
> java.io.IOException: There appears to be a gap in the edit log.  We expected 
> txid 5949, but got txid 5950.
> at 
> org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:163)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:692)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:223)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:599)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1325)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1233)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> 

[jira] [Commented] (HDFS-3958) Integrate upgrade/finalize/rollback with external journals

2012-09-19 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458528#comment-13458528
 ] 

Ivan Kelly commented on HDFS-3958:
--

I guess an "upgrade" operation itself isn't really needed, as when you upgrade 
the NN, it merges the editlog with the image. Then the upgraded NN can use what 
looks like a newly formatted editlog. Doing this on a live system with multiple 
possible writers to the editlog could be problematic though. 

Snapshotting itself shouldn't be too difficult for us, as we just need to take 
a snapshot of the zk tree at a certain znode version.

> Integrate upgrade/finalize/rollback with external journals
> --
>
> Key: HDFS-3958
> URL: https://issues.apache.org/jira/browse/HDFS-3958
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 3.0.0
>Reporter: Todd Lipcon
>
> Currently the NameNode upgrade/rollback/finalize framework only supports 
> local storage. With edits being stored in pluggable Journals, this could 
> create certain difficulties - in particular, rollback wouldn't actually 
> rollback the external storage to the old state.
> We should look at how to expose the right hooks to the external journal 
> storage to snapshot/rollback/finalize.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-09-07 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450478#comment-13450478
 ] 

Ivan Kelly commented on HDFS-3809:
--

[~umamahesh] I think both these should be pushed to Branch-2 though. I'll ping 
todd, see if he had a particular reason for not pushing them there.

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0
>
> Attachments: HDFS-3809.diff, HDFS-3809.diff, HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-09-06 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449801#comment-13449801
 ] 

Ivan Kelly commented on HDFS-3809:
--

We use it later in HDFS-3810. I can move it to that patch, but if it's only a 
small issue, i think it'd better to leave it here to push this patch.

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3809.diff, HDFS-3809.diff, HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3891) QJM: SBN fails if selectInputStreams throws RTE

2012-09-06 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449736#comment-13449736
 ] 

Ivan Kelly commented on HDFS-3891:
--

Yup, #2 is definitely better than one. Do you remember why it wasn't this way 
from the start? Seems like it really should have been?

> QJM: SBN fails if selectInputStreams throws RTE
> ---
>
> Key: HDFS-3891
> URL: https://issues.apache.org/jira/browse/HDFS-3891
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: QuorumJournalManager (HDFS-3077)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: QuorumJournalManager (HDFS-3077)
>
> Attachments: hdfs-3891.txt, hdfs-3891.txt
>
>
> Currently, QJM's {{selectInputStream}} method throws an RTE if a quorum 
> cannot be reached. This propagates into the Standby Node and causes the whole 
> node to crash. It should handle this error appropriately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3810) Implement format() for BKJM

2012-09-06 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3810:
-

Attachment: HDFS-3810.diff

New patch addresses comments.

> Implement format() for BKJM
> ---
>
> Key: HDFS-3810
> URL: https://issues.apache.org/jira/browse/HDFS-3810
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3810.diff, HDFS-3810.diff
>
>
> At the moment, formatting for BKJM is done on initialization. Reinitializing 
> is a manual process. This JIRA is to implement the JournalManager#format API, 
> so that BKJM can be formatting along with all other storage methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-09-06 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Attachment: HDFS-3809.diff

New patch addresses comments.

{quote}
One doubt, Do we need to handle existing BKJM layout data compatibility, while 
reading the existing ledgers..?
{quote}
The current layout is only in the -alpha releases, which are marked as such 
because APIs haven't been finalized. This can be considered part of this.

{quote}
CURRENT_INPROGRESS_LAYOUT_VERSION version check is removed from the 
CurrentInprogress.java, do you think this version check not required. In that 
case CURRENT_INPROGRESS_LAYOUT_VERSION
and also CONTENT_DELIMITER can be removed from CurrentInprogress.java
{quote}
Protobufs remove the need for an explicit inprogress layout version.

{quote}
In CurrentInprogressProto, why hostName is made optional.? is there any 
specific reason for it..? But i can see that previously always hostname was 
present in data.
{quote}
It's optional because it's not strictly necessary to function. It's just 
convenient for debugging. Protobuf guidelines suggest that you always make 
things optional unless they absolutely need to be there.


> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3809.diff, HDFS-3809.diff, HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-08-16 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Attachment: HDFS-3809.diff

Fixed findbugs (exclude protobuf from check)

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3809.diff, HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3810) Implement format() for BKJM

2012-08-16 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3810:
-

Status: Patch Available  (was: Open)

> Implement format() for BKJM
> ---
>
> Key: HDFS-3810
> URL: https://issues.apache.org/jira/browse/HDFS-3810
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 3.0.0
>
> Attachments: HDFS-3810.diff
>
>
> At the moment, formatting for BKJM is done on initialization. Reinitializing 
> is a manual process. This JIRA is to implement the JournalManager#format API, 
> so that BKJM can be formatting along with all other storage methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3810) Implement format() for BKJM

2012-08-16 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3810:
-

Attachment: HDFS-3810.diff

Patch applies over patch for HDFS-3809.

> Implement format() for BKJM
> ---
>
> Key: HDFS-3810
> URL: https://issues.apache.org/jira/browse/HDFS-3810
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3810.diff
>
>
> At the moment, formatting for BKJM is done on initialization. Reinitializing 
> is a manual process. This JIRA is to implement the JournalManager#format API, 
> so that BKJM can be formatting along with all other storage methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-08-16 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Attachment: HDFS-3809.diff

Patch makes BKJM use NamespaceInfo also.

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-08-16 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3809:
-

Status: Patch Available  (was: Open)

> Make BKJM use protobufs for all serialization with ZK
> -
>
> Key: HDFS-3809
> URL: https://issues.apache.org/jira/browse/HDFS-3809
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3809.diff
>
>
> HDFS uses protobufs for serialization in many places. Protobufs allow fields 
> to be added without breaking bc or requiring new parsing code to be written. 
> For this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3810) Implement format() for BKJM

2012-08-16 Thread Ivan Kelly (JIRA)
Ivan Kelly created HDFS-3810:


 Summary: Implement format() for BKJM
 Key: HDFS-3810
 URL: https://issues.apache.org/jira/browse/HDFS-3810
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.0.0


At the moment, formatting for BKJM is done on initialization. Reinitializing is 
a manual process. This JIRA is to implement the JournalManager#format API, so 
that BKJM can be formatting along with all other storage methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3809) Make BKJM use protobufs for all serialization with ZK

2012-08-16 Thread Ivan Kelly (JIRA)
Ivan Kelly created HDFS-3809:


 Summary: Make BKJM use protobufs for all serialization with ZK
 Key: HDFS-3809
 URL: https://issues.apache.org/jira/browse/HDFS-3809
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.0.0


HDFS uses protobufs for serialization in many places. Protobufs allow fields to 
be added without breaking bc or requiring new parsing code to be written. For 
this reason, we should use them in BKJM also.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3789) JournalManager#format() should be able to throw IOException

2012-08-13 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3789:
-

Status: Patch Available  (was: Open)

> JournalManager#format() should be able to throw IOException
> ---
>
> Key: HDFS-3789
> URL: https://issues.apache.org/jira/browse/HDFS-3789
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3789.diff
>
>
> Currently JournalManager#format cannot throw any exception. As format can 
> fail, we should be able to propogate this failure upwards. Otherwise, format 
> will fail silently, and the admin will start using the cluster with a 
> failed/unusable journal manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3789) JournalManager#format() should be able to throw IOException

2012-08-13 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3789:
-

Attachment: HDFS-3789.diff

> JournalManager#format() should be able to throw IOException
> ---
>
> Key: HDFS-3789
> URL: https://issues.apache.org/jira/browse/HDFS-3789
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3789.diff
>
>
> Currently JournalManager#format cannot throw any exception. As format can 
> fail, we should be able to propogate this failure upwards. Otherwise, format 
> will fail silently, and the admin will start using the cluster with a 
> failed/unusable journal manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3789) JournalManager#format() should be able to throw IOException

2012-08-13 Thread Ivan Kelly (JIRA)
Ivan Kelly created HDFS-3789:


 Summary: JournalManager#format() should be able to throw 
IOException
 Key: HDFS-3789
 URL: https://issues.apache.org/jira/browse/HDFS-3789
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: 3.0.0
Reporter: Ivan Kelly
Assignee: Ivan Kelly


Currently JournalManager#format cannot throw any exception. As format can fail, 
we should be able to propogate this failure upwards. Otherwise, format will 
fail silently, and the admin will start using the cluster with a 
failed/unusable journal manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3695) Genericize format() and initializeSharedEdits() to non-file JournalManagers

2012-08-09 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431784#comment-13431784
 ] 

Ivan Kelly commented on HDFS-3695:
--

New patch looks good +1

> Genericize format() and initializeSharedEdits() to non-file JournalManagers
> ---
>
> Key: HDFS-3695
> URL: https://issues.apache.org/jira/browse/HDFS-3695
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: QuorumJournalManager (HDFS-3077)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: hdfs-3695.txt, hdfs-3695.txt, hdfs-3695.txt
>
>
> Currently, the "namenode -format" and "namenode -initializeSharedEdits" 
> commands do not understand how to do anything with non-file-based shared 
> storage. This affects both BookKeeperJournalManager and QuorumJournalManager.
> This JIRA is to plumb through the formatting of edits directories using 
> pluggable journal manager implementations so that no separate step needs to 
> be taken to format them -- the same commands will work for NFS-based storage 
> or one of the alternate implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3695) Genericize format() and initializeSharedEdits() to non-file JournalManagers

2012-08-08 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430939#comment-13430939
 ] 

Ivan Kelly commented on HDFS-3695:
--

New patch looks good, though I think FormatConfirmation would be better if 
called FormatConfirmable. Why didn't you make JournalManager extend 
FormatConfirmation? Surely all implementations of JournalManager will need this 
functionality.

Since this applies cleanly to trunk, will you be submitting there as well as 
the QJM branch?

> Genericize format() and initializeSharedEdits() to non-file JournalManagers
> ---
>
> Key: HDFS-3695
> URL: https://issues.apache.org/jira/browse/HDFS-3695
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: QuorumJournalManager (HDFS-3077)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: hdfs-3695.txt, hdfs-3695.txt
>
>
> Currently, the "namenode -format" and "namenode -initializeSharedEdits" 
> commands do not understand how to do anything with non-file-based shared 
> storage. This affects both BookKeeperJournalManager and QuorumJournalManager.
> This JIRA is to plumb through the formatting of edits directories using 
> pluggable journal manager implementations so that no separate step needs to 
> be taken to format them -- the same commands will work for NFS-based storage 
> or one of the alternate implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3695) Genericize format() and initializeSharedEdits() to non-file JournalManagers

2012-08-01 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426616#comment-13426616
 ] 

Ivan Kelly commented on HDFS-3695:
--

Generally looks good to me.

One issue though, it that there is now way to confirm with the user that they 
do in fact want to clear the journal. Perhaps you could define a confirmation 
callback which could be passed down through. 
{code}
interface FormatConfirmationCallback {
boolean confirmFormat(URI journal);
}
{code}

Are you planning on putting this in trunk and branch-2 at the same time as 
HDFS-3077 branch? 

> Genericize format() and initializeSharedEdits() to non-file JournalManagers
> ---
>
> Key: HDFS-3695
> URL: https://issues.apache.org/jira/browse/HDFS-3695
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: QuorumJournalManager (HDFS-3077)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: hdfs-3695.txt
>
>
> Currently, the "namenode -format" and "namenode -initializeSharedEdits" 
> commands do not understand how to do anything with non-file-based shared 
> storage. This affects both BookKeeperJournalManager and QuorumJournalManager.
> This JIRA is to plumb through the formatting of edits directories using 
> pluggable journal manager implementations so that no separate step needs to 
> be taken to format them -- the same commands will work for NFS-based storage 
> or one of the alternate implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3464) BKJM: Deleting currentLedger and leaving 'inprogress_x' on exceptions can throw BKNoSuchLedgerExistsException later.

2012-06-15 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295697#comment-13295697
 ] 

Ivan Kelly commented on HDFS-3464:
--

Have you seen this exception with the latest code? From what I see it's not 
possible.

{code}
try {
  String znodePath = inprogressZNode(txId);
  EditLogLedgerMetadata l = new EditLogLedgerMetadata(znodePath,
  HdfsConstants.LAYOUT_VERSION,  currentLedger.getId(), txId);
  /* Write the ledger metadata out to the inprogress ledger znode
   * This can fail if for some reason our write lock has
   * expired (@see WriteLock) and another process has managed to
   * create the inprogress znode.
   * In this case, throw an exception. We don't want to continue
   * as this would lead to a split brain situation.
   */
  l.write(zkc, znodePath);

  maxTxId.store(txId);
  ci.update(znodePath);
  return new BookKeeperEditLogOutputStream(conf, currentLedger);
} catch (KeeperException ke) {
  cleanupLedger(currentLedger);
  throw new IOException("Error storing ledger metadata", ke);
}
{code}
Only l.write() can throw an KeeperException, and in this case, it can only be a 
NodeExistsException (we should perhaps tighten the catch). In the case of 
NodeExists, we do want to delete the ledger, but not the inprogress znode. 
Otherwise, only a IOException will be thrown, and the 
BKNoSuchLedgerExistsException shouldn't happen.

> BKJM: Deleting currentLedger and leaving 'inprogress_x'  on exceptions can 
> throw BKNoSuchLedgerExistsException later.
> -
>
> Key: HDFS-3464
> URL: https://issues.apache.org/jira/browse/HDFS-3464
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.1-alpha, 3.0.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>
> HDFS-3058 will clean currentLedgers on exception.
> In BookKeeperJournalManager, startLogSegment() is deleting the corresponding 
> 'inprogress_ledger' ledger on exception. Here leaving the 'inprogress_x' 
> ledger metadata in ZooKeeper. When the other node becomes active, he will see 
> the 'inprogress_x' znode and tries to recoverLastTxId() it would throw 
> exception, since there is no 'inprogress_ledger' exists. 
> {noformat}
> Caused by: 
> org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException
>   at 
> org.apache.bookkeeper.client.BookKeeper.openLedger(BookKeeper.java:393)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverLastTxId(BookKeeperJournalManager.java:493)
> {noformat}
> As per the discussion in HDFS-3058, we will handle the coment as part of this 
> JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3389) Document the BKJM usage in Namenode HA.

2012-06-05 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3389:
-

Attachment: HDFS-3389.diff

Documentation looks good. I made a few modifications which I think makes things 
clearer.  

> Document the BKJM usage in Namenode HA.
> ---
>
> Key: HDFS-3389
> URL: https://issues.apache.org/jira/browse/HDFS-3389
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: 2.0.0-alpha, 3.0.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-3389.diff, HDFS-3389.patch, 
> HDFSHighAvailability.html
>
>
> As per the discussion in HDFS-234, We need clear documentation for BKJM usage 
> in Namenode HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-31 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286588#comment-13286588
 ] 

Ivan Kelly commented on HDFS-3441:
--

Latest patch is good. +1 from me.

> Race condition between rolling logs at active NN and purging at standby
> ---
>
> Key: HDFS-3441
> URL: https://issues.apache.org/jira/browse/HDFS-3441
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.0-alpha
>Reporter: suja s
>Assignee: Rakesh R
> Attachments: HDFS-3441.1.patch, HDFS-3441.2.patch, HDFS-3441.3.patch, 
> HDFS-3441.patch
>
>
> Standby NN has got the ledgerlist with list of all files, including the 
> inprogress file (with say inprogress_val1)
> Active NN has done finalization and created new inprogress file.
> Standby when proceeds further finds that the inprogress file which it had in 
> the list is not present and NN gets shutdown
> NN Logs
> =
> 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Image file of size 201 saved in 0 seconds.
> 2012-05-17 22:15:03,874 INFO 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll 
> on remote NameNode /xx.xx.xx.102:8020
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
> retain 2 images with txid >= 111
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
> image 
> FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
>  cpktTxId=109)
> 2012-05-17 22:15:03,961 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 
> failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
>  stream=null))
> java.io.IOException: Exception reading ledger list from zk
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
>   at 
> org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
>   ... 16 more
> 2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> ZK Data
> 
> [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
> -40;59;11

[jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-31 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286549#comment-13286549
 ] 

Ivan Kelly commented on HDFS-3441:
--

looks good. Just needs a small tweak in the test.

+List ledgerNames = new ArrayList(1);
+ledgerNames.add("editlog_120_121");

ledgerNames is never used now.

> Race condition between rolling logs at active NN and purging at standby
> ---
>
> Key: HDFS-3441
> URL: https://issues.apache.org/jira/browse/HDFS-3441
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.0-alpha
>Reporter: suja s
>Assignee: Rakesh R
> Attachments: HDFS-3441.1.patch, HDFS-3441.2.patch, HDFS-3441.patch
>
>
> Standby NN has got the ledgerlist with list of all files, including the 
> inprogress file (with say inprogress_val1)
> Active NN has done finalization and created new inprogress file.
> Standby when proceeds further finds that the inprogress file which it had in 
> the list is not present and NN gets shutdown
> NN Logs
> =
> 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Image file of size 201 saved in 0 seconds.
> 2012-05-17 22:15:03,874 INFO 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll 
> on remote NameNode /xx.xx.xx.102:8020
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
> retain 2 images with txid >= 111
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
> image 
> FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
>  cpktTxId=109)
> 2012-05-17 22:15:03,961 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 
> failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
>  stream=null))
> java.io.IOException: Exception reading ledger list from zk
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
>   at 
> org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
>   ... 16 more
> 2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> Z

[jira] [Commented] (HDFS-3423) BKJM: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-31 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286546#comment-13286546
 ] 

Ivan Kelly commented on HDFS-3423:
--

lgtm +1.

> BKJM: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad 
> inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff, HDFS-3423.diff, HDFS-3423.diff, 
> HDFS-3423.patch, HDFS-3423.patch
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3423) BKJM: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-31 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286434#comment-13286434
 ] 

Ivan Kelly commented on HDFS-3423:
--

Generally looked good to me. 
Why did you add UnsupportedEncodingException in MaxTxId?

> BKJM: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad 
> inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff, HDFS-3423.diff, HDFS-3423.diff, 
> HDFS-3423.patch
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-30 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285659#comment-13285659
 ] 

Ivan Kelly commented on HDFS-3441:
--

the code of the patch looks good. I think the test should be a little lower 
level and more explicit though. Its not clear that the new code path is 
actually getting exercised. Make getLedgerList() package private, and test that 
when you getChildren() returns a list of ledgers, and some do not exist when 
the read occurs, that you still get the others back fine. For this you would 
have to use spy() instead of mock(), and just stub out the getData() method.

> Race condition between rolling logs at active NN and purging at standby
> ---
>
> Key: HDFS-3441
> URL: https://issues.apache.org/jira/browse/HDFS-3441
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.0-alpha
>Reporter: suja s
>Assignee: Rakesh R
> Attachments: HDFS-3441.1.patch, HDFS-3441.patch
>
>
> Standby NN has got the ledgerlist with list of all files, including the 
> inprogress file (with say inprogress_val1)
> Active NN has done finalization and created new inprogress file.
> Standby when proceeds further finds that the inprogress file which it had in 
> the list is not present and NN gets shutdown
> NN Logs
> =
> 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Image file of size 201 saved in 0 seconds.
> 2012-05-17 22:15:03,874 INFO 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll 
> on remote NameNode /xx.xx.xx.102:8020
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
> retain 2 images with txid >= 111
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
> image 
> FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
>  cpktTxId=109)
> 2012-05-17 22:15:03,961 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 
> failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
>  stream=null))
> java.io.IOException: Exception reading ledger list from zk
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
>   at 
> org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.rea

[jira] [Updated] (HDFS-3474) Cleanup Exception handling in BookKeeper journal manager

2012-05-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3474:
-

Attachment: HDFS-3474.diff

Fixed typo.

> Cleanup Exception handling in BookKeeper journal manager
> 
>
> Key: HDFS-3474
> URL: https://issues.apache.org/jira/browse/HDFS-3474
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3474.diff, HDFS-3474.diff
>
>
> There are a couple of instance of:
> {code}
> try {
>   ...
> } catch (Exception e) {
>   ...
> }
> {code}
> We should be catching the specific exceptions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3470) [Log Improvement]: Error logs logged at the time of switch which is misleading

2012-05-30 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285567#comment-13285567
 ] 

Ivan Kelly commented on HDFS-3470:
--

This occurs because of the way we open ledgers in bookkeeper. The bookies keep 
a record of the last entry which has been confirmed at the client (the client 
piggybacks this on newer entries). A client coming to open a ledger which 
hasn't been closed will get the last confirmed entry, and then read forward, 
one entry at a time, to find the last entry.

Perhaps we should change the error to a warn, no move where it bad read is 
logged. In a normal read, it's bad, but for recovery, it's expected. Moving to 
bookkeeper in any case, as this can't be changed on the HDFS side.

> [Log Improvement]: Error logs logged at the time of switch which is misleading
> --
>
> Key: HDFS-3470
> URL: https://issues.apache.org/jira/browse/HDFS-3470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: suja s
>Priority: Minor
>
> For every switch the following error is logged at NN side
> 2012-05-17 11:19:46,021 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3182
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3183
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3181
> {noformat}
> 2012-05-17 11:19:45,891 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services 
> started for standby state
> 2012-05-17 11:19:45,893 WARN 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer 
> interrupted
> java.lang.InterruptedException: sleep interrupted
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:329)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:274)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:291)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:287)
> 2012-05-17 11:19:45,895 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Disconnected from bookie: 
> /XX.XX.XX.55:3182
> 2012-05-17 11:19:45,895 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Disconnected from bookie: 
> /XX.XX.XX.55:3181
> 2012-05-17 11:19:45,913 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x1375944cf250027 closed
> 2012-05-17 11:19:45,913 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2012-05-17 11:19:45,913 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2012-05-17 11:19:45,913 INFO org.apache.zookeeper.ZooKeeper: Initiating 
> client connection, connectString=XX.XX.XX.55:2182 sessionTimeout=3000 
> watcher=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager$ZkConnectionWatcher@734d246
> 2012-05-17 11:19:45,914 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server /XX.XX.XX.55:2182
> 2012-05-17 11:19:45,915 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to HOST-10-18-52-55/XX.XX.XX.55:2182, initiating 
> session
> 2012-05-17 11:19:45,947 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server HOST-10-18-52-55/XX.XX.XX.55:2182, sessionid 
> = 0x1375944cf250029, negotiated timeout = 4000
> 2012-05-17 11:19:45,994 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Successfully connected to 
> bookie: /XX.XX.XX.55:3181
> 2012-05-17 11:19:45,996 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Successfully connected to 
> bookie: /XX.XX.XX.55:3183
> 2012-05-17 11:19:46,001 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Successfully connected to 
> bookie: /XX.XX.XX.55:3182
> 2012-05-17 11:19:46,021 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3182
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3183
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX

[jira] [Updated] (HDFS-3470) [Log Improvement]: Error logs logged at the time of switch which is misleading

2012-05-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3470:
-

Issue Type: Bug  (was: Sub-task)
Parent: (was: HDFS-3399)

> [Log Improvement]: Error logs logged at the time of switch which is misleading
> --
>
> Key: HDFS-3470
> URL: https://issues.apache.org/jira/browse/HDFS-3470
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: suja s
>Priority: Minor
>
> For every switch the following error is logged at NN side
> 2012-05-17 11:19:46,021 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3182
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3183
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3181
> {noformat}
> 2012-05-17 11:19:45,891 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services 
> started for standby state
> 2012-05-17 11:19:45,893 WARN 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer 
> interrupted
> java.lang.InterruptedException: sleep interrupted
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:329)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:274)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:291)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:287)
> 2012-05-17 11:19:45,895 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Disconnected from bookie: 
> /XX.XX.XX.55:3182
> 2012-05-17 11:19:45,895 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Disconnected from bookie: 
> /XX.XX.XX.55:3181
> 2012-05-17 11:19:45,913 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x1375944cf250027 closed
> 2012-05-17 11:19:45,913 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2012-05-17 11:19:45,913 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2012-05-17 11:19:45,913 INFO org.apache.zookeeper.ZooKeeper: Initiating 
> client connection, connectString=XX.XX.XX.55:2182 sessionTimeout=3000 
> watcher=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager$ZkConnectionWatcher@734d246
> 2012-05-17 11:19:45,914 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server /XX.XX.XX.55:2182
> 2012-05-17 11:19:45,915 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to HOST-10-18-52-55/XX.XX.XX.55:2182, initiating 
> session
> 2012-05-17 11:19:45,947 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server HOST-10-18-52-55/XX.XX.XX.55:2182, sessionid 
> = 0x1375944cf250029, negotiated timeout = 4000
> 2012-05-17 11:19:45,994 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Successfully connected to 
> bookie: /XX.XX.XX.55:3181
> 2012-05-17 11:19:45,996 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Successfully connected to 
> bookie: /XX.XX.XX.55:3183
> 2012-05-17 11:19:46,001 INFO 
> org.apache.bookkeeper.proto.PerChannelBookieClient: Successfully connected to 
> bookie: /XX.XX.XX.55:3182
> 2012-05-17 11:19:46,021 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3182
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3183
> 2012-05-17 11:19:46,022 ERROR org.apache.bookkeeper.client.PendingReadOp: 
> Error: No such entry while reading entry: 1 ledgerId: 16 from bookie: 
> /XX.XX.XX.55:3181
> 2012-05-17 11:19:46,045 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in 
> /home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current
> 2012-05-17 11:19:46,049 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits 
> file 
> /home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/edits_inprogress_036
>  -> 
> /home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/edits_036-036
> 2012-05-17 11:19:46,049 INFO 
> or

[jira] [Commented] (HDFS-3468) Make BKJM-ZK session timeout configurable.

2012-05-30 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285563#comment-13285563
 ] 

Ivan Kelly commented on HDFS-3468:
--

lgtm +1


> Make BKJM-ZK session timeout configurable.
> --
>
> Key: HDFS-3468
> URL: https://issues.apache.org/jira/browse/HDFS-3468
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.0-alpha
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-3468.patch
>
>
> Currently BKJM-ZK session timeout has been hardcoded to 3000ms.
> Considering ZKFC timeouts configurable, we should make this also configurable.
> Unfortunale we can not compare both the values at runtime, because they both 
> are different processes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3423:
-

Attachment: HDFS-3423.diff

Rebased on trunk. This patch clashes with HDFS-3474, so which ever goes in 
first, the other will have to rebased.

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff, HDFS-3423.diff, HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3474) Cleanup Exception handling in BookKeeper journal manager

2012-05-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3474:
-

Attachment: HDFS-3474.diff

Patch to replace "catch (Exception e)" with specific exceptions

> Cleanup Exception handling in BookKeeper journal manager
> 
>
> Key: HDFS-3474
> URL: https://issues.apache.org/jira/browse/HDFS-3474
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3474.diff
>
>
> There are a couple of instance of:
> {code}
> try {
>   ...
> } catch (Exception e) {
>   ...
> }
> {code}
> We should be catching the specific exceptions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3474) Cleanup Exception handling in BookKeeper journal manager

2012-05-30 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3474:
-

Status: Patch Available  (was: Open)

> Cleanup Exception handling in BookKeeper journal manager
> 
>
> Key: HDFS-3474
> URL: https://issues.apache.org/jira/browse/HDFS-3474
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-3474.diff
>
>
> There are a couple of instance of:
> {code}
> try {
>   ...
> } catch (Exception e) {
>   ...
> }
> {code}
> We should be catching the specific exceptions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3474) Cleanup Exception handling in BookKeeper journal manager

2012-05-30 Thread Ivan Kelly (JIRA)
Ivan Kelly created HDFS-3474:


 Summary: Cleanup Exception handling in BookKeeper journal manager
 Key: HDFS-3474
 URL: https://issues.apache.org/jira/browse/HDFS-3474
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Ivan Kelly
Assignee: Ivan Kelly


There are a couple of instance of:

{code}
try {
  ...
} catch (Exception e) {
  ...
}
{code}

We should be catching the specific exceptions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-29 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284939#comment-13284939
 ] 

Ivan Kelly commented on HDFS-3452:
--

lgtm +1

> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> ---
>
> Key: HDFS-3452
> URL: https://issues.apache.org/jira/browse/HDFS-3452
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.0-alpha
>Reporter: suja s
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Attachments: BK-253-BKJM.patch, HDFS-3452-1.patch, HDFS-3452-2.patch, 
> HDFS-3452.patch, HDFS-3452.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-29 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284805#comment-13284805
 ] 

Ivan Kelly commented on HDFS-3452:
--

For #1, what you have is good.



> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> ---
>
> Key: HDFS-3452
> URL: https://issues.apache.org/jira/browse/HDFS-3452
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.0-alpha
>Reporter: suja s
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Attachments: BK-253-BKJM.patch, HDFS-3452-1.patch, HDFS-3452.patch, 
> HDFS-3452.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-29 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284727#comment-13284727
 ] 

Ivan Kelly commented on HDFS-3452:
--

The new patch looks almost ready to go. I have a couple of comments though.

# in finalizeLogSegment, ci.clear() is in a finally block. This means that the 
currentInprogress is cleared even if finalization fails. I think 
finalizeLogSegment should call ci.read() and only call clear if the inprogress 
znode it is finalizing matches.
# clear() should use versionNumberForPermission in its setData
# read() should check the layout version and fail if it's greater than the 
current layout version. Also, i think the currentInprogress layout should have 
it's own layout version, rather than using the BKJM one.

> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> ---
>
> Key: HDFS-3452
> URL: https://issues.apache.org/jira/browse/HDFS-3452
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.0.0-alpha
>Reporter: suja s
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Attachments: BK-253-BKJM.patch, HDFS-3452-1.patch, HDFS-3452.patch, 
> HDFS-3452.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA

[jira] [Commented] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-25 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283547#comment-13283547
 ] 

Ivan Kelly commented on HDFS-3452:
--

Patch looks good Uma. A few comments.

# The version in the data should be a format version, in case we wish to change 
the data format in future. Not the znode version
# the creation of the inprogress znode should catch a nodeexists exception in 
the case to two nodes starting at once
# in javadoc, @update should be #update
# "Already inprogress node exists" -> "Inprogress node already exists"
# TestBookKeeperJournalManager#testAllBookieFailure: you need to add  
bkjm.recoverUnfinalizedSegments() before the failing startLogSegment. 
# TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted: this needs to be 
changed. Fix is simple though, now that the locking has changed, we it's the nn 
who was previously working which dies, not the new one trying to start. Code 
below

{code}
  @Test
  public void testMultiplePrimariesStarted() throws Exception {
Runtime mockRuntime1 = mock(Runtime.class);
Runtime mockRuntime2 = mock(Runtime.class);
Path p1 = new Path("/testBKJMMultiplePrimary");
Path p2 = new Path("/testBKJMMultiplePrimary2");

MiniDFSCluster cluster = null;
try {
  Configuration conf = new Configuration();
  conf.setInt(DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_KEY, 1);
  conf.set(DFSConfigKeys.DFS_NAMENODE_SHARED_EDITS_DIR_KEY,
   BKJMUtil.createJournalURI("/hotfailoverMultiple").toString());
  BKJMUtil.addJournalManagerDefinition(conf);

  cluster = new MiniDFSCluster.Builder(conf)
.nnTopology(MiniDFSNNTopology.simpleHATopology())
.numDataNodes(0)
.manageNameDfsSharedDirs(false)
.build();
  NameNode nn1 = cluster.getNameNode(0);
  NameNode nn2 = cluster.getNameNode(1);
  FSEditLogTestUtil.setRuntimeForEditLog(nn1, mockRuntime1);
  FSEditLogTestUtil.setRuntimeForEditLog(nn2, mockRuntime2);
  cluster.waitActive();
  cluster.transitionToActive(0);

  FileSystem fs = HATestUtil.configureFailoverFs(cluster, conf);
  fs.mkdirs(p1);
  nn1.getRpcServer().rollEditLog();
  cluster.transitionToActive(1);

  verify(mockRuntime1, times(0)).exit(anyInt());
  fs.mkdirs(p2);

  verify(mockRuntime1, atLeastOnce()).exit(anyInt());
  verify(mockRuntime2, times(0)).exit(anyInt());

} finally {
  if (cluster != null) {
cluster.shutdown();
  }
}
  }
{code}

Other than that, I think this is ready to go. Good work :)

> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> ---
>
> Key: HDFS-3452
> URL: https://issues.apache.org/jira/browse/HDFS-3452
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: suja s
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Attachments: BK-253-BKJM.patch, HDFS-3452.patch, HDFS-3452.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:6

[jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-22 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280987#comment-13280987
 ] 

Ivan Kelly commented on HDFS-3441:
--

Patch looks good to me. It would be good to have some testing though.

> Race condition between rolling logs at active NN and purging at standby
> ---
>
> Key: HDFS-3441
> URL: https://issues.apache.org/jira/browse/HDFS-3441
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: suja s
> Attachments: HDFS-3441.patch
>
>
> Standby NN has got the ledgerlist with list of all files, including the 
> inprogress file (with say inprogress_val1)
> Active NN has done finalization and created new inprogress file.
> Standby when proceeds further finds that the inprogress file which it had in 
> the list is not present and NN gets shutdown
> NN Logs
> =
> 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Image file of size 201 saved in 0 seconds.
> 2012-05-17 22:15:03,874 INFO 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll 
> on remote NameNode /xx.xx.xx.102:8020
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
> retain 2 images with txid >= 111
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
> image 
> FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
>  cpktTxId=109)
> 2012-05-17 22:15:03,961 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 
> failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
>  stream=null))
> java.io.IOException: Exception reading ledger list from zk
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
>   at 
> org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
>   ... 16 more
> 2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> ZK Data
> 
> [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
> -40;59;116
> cZxid = 0x2be
> ctime = Thu May 17 22:15:03 IST 2012
> mZxid = 0x2be
> mtime = Thu May 

[jira] [Commented] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-22 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280967#comment-13280967
 ] 

Ivan Kelly commented on HDFS-3423:
--

How would you change it though? I don't see a clean way to avoid the store that 
keeps the same clarity.

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff, HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3386) BK JM : Namenode is not deleting his lock entry '/ledgers/lock/lock-0000X', when fails to acquire the lock

2012-05-22 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280964#comment-13280964
 ] 

Ivan Kelly commented on HDFS-3386:
--

HDFS-3452 will make this unnecessary. 

> BK JM : Namenode is not deleting his lock entry '/ledgers/lock/lock-X', 
> when fails to acquire the lock
> --
>
> Key: HDFS-3386
> URL: https://issues.apache.org/jira/browse/HDFS-3386
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Reporter: surendra singh lilhore
>Assignee: Ivan Kelly
>Priority: Minor
> Fix For: 2.0.0, 3.0.0
>
> Attachments: HDFS-3386.diff
>
>
> When a Standby NN becomes Active, it will first create his sequential lock 
> entry create lock-000X  in ZK and then tries to acquire the lock as shown 
> below:
> {quote}
> myznode = zkc.create(lockpath + "/lock-", new byte[] {'0'},
>  Ids.OPEN_ACL_UNSAFE,
>  CreateMode.EPHEMERAL_SEQUENTIAL);
> if ((lockpath + "/" + nodes.get(0)).equals(myznode)) {
> if (LOG.isTraceEnabled()) {
> LOG.trace("Lock acquired - " + myznode);
> }
> lockCount.set(1);
> zkc.exists(myznode, this);
> return;
> } else {
> LOG.error("Failed to acquire lock with " + myznode
> + ", " + nodes.get(0) + " already has it");
> throw new IOException("Could not acquire lock");
> }  
> {quote}
> Say the transition to standby fails to acquire the lock it will throw the 
> exception and NN is getting shutdown. Here the problem is, the lock entry 
> lock-000X will exists in the ZK till session expiry and the further start-up 
> will not be able to acquire lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-22 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280956#comment-13280956
 ] 

Ivan Kelly commented on HDFS-3452:
--

Moved to HDFS.

I've thought about this a bit more. Your patch is a good start, but it actually 
does more than we need in some parts. Really, the purpose of the locking is to 
ensure that we do not add new entries without having read all previous entries. 
Locking on the creation of inprogress znodes should be enough to ensure this. 
Fencing should take care of any other cases.

startLogSegment should work as follows:
 # get version(V) and content(C) of writePermissions znode. C is the path to an 
inprogress znode (Z1), or null
 # if Z1 exists, throw an exception. Otherwise proceed.
 # create inprogress znode(Z2)  and ledger.
 # write writePermissions znode with Z2 and V.

finalizeLogSegment should read writePermissions znode and null it if content 
matches the inprogress znode it is finalizing.

So, 
a) I think WritePermission should be called something more like 
CurrentInprogress.
b) The interface should be something like 
{code}
public class CurrentInprogress {
String readCurrent(); // returns current znode or null
void updateCurrent(String path) throws Exception;
void clearCurrent();
}
{code}
c) This only ever needs to be used in startLogSegment. #clearCurrent is really 
optional, but there for completeness.
d) #checkPermission is unnecessary. If something else has opened another 
inprogress znode while we are writing, it should have closed the ledger we were 
writing to, thereby fencing it, thereby stopping any further writes.
e) The actual data stored in the znode should include a version number, a 
hostname and then the path. This will make debugging easier.
f) You have some tabs.

> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> ---
>
> Key: HDFS-3452
> URL: https://issues.apache.org/jira/browse/HDFS-3452
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: suja s
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Attachments: BK-253-BKJM.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.ha

[jira] [Updated] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-22 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3452:
-

Issue Type: Sub-task  (was: Bug)
Parent: HDFS-3399

> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> ---
>
> Key: HDFS-3452
> URL: https://issues.apache.org/jira/browse/HDFS-3452
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: suja s
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Attachments: BK-253-BKJM.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Moved] (HDFS-3452) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

2012-05-22 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly moved BOOKKEEPER-253 to HDFS-3452:
-

 Component/s: (was: bookkeeper-client)
Target Version/s: 2.0.0, 3.0.0
 Key: HDFS-3452  (was: BOOKKEEPER-253)
 Project: Hadoop HDFS  (was: Bookkeeper)

> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> ---
>
> Key: HDFS-3452
> URL: https://issues.apache.org/jira/browse/HDFS-3452
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: suja s
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Attachments: BK-253-BKJM.patch
>
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-07, lock-06 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-18 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278917#comment-13278917
 ] 

Ivan Kelly commented on HDFS-3423:
--

@Rakesh
In the case of a primary nn crashing, the maxTxId.store() is necessary on 
recover, to ensure that the value is correct. The case where the #store is 
unnecessary(i.e. a crash during finalization) is much rarer than where it is 
necessary (i.e. crash while writing).

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff, HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-18 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3423:
-

Attachment: HDFS-3423.diff

New patch fixes findbugs

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff, HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3058) HA: Bring BookKeeperJournalManager up to date with HA changes

2012-05-18 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3058:
-

Attachment: HDFS-3058.diff

Rebased on trunk. Some fixups were required to tests, as the way Namenode exits 
from bad edit logs has changed.

> HA: Bring BookKeeperJournalManager up to date with HA changes
> -
>
> Key: HDFS-3058
> URL: https://issues.apache.org/jira/browse/HDFS-3058
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 0.24.0
>
> Attachments: HDFS-3058.diff, HDFS-3058.diff, HDFS-3058.diff
>
>
> There's a couple of TODO(HA) comments in the BookKeeperJournalManager code. 
> This JIRA is to address those.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-18 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278802#comment-13278802
 ] 

Ivan Kelly commented on HDFS-3423:
--

If the store here is to a txid older than what we currently have, nothing with 
happen. Take a look at the implementation of #store.

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-18 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278717#comment-13278717
 ] 

Ivan Kelly commented on HDFS-3441:
--

This solution looks good to me. Will you generate a patch?

> Race condition between rolling logs at active NN and purging at standby
> ---
>
> Key: HDFS-3441
> URL: https://issues.apache.org/jira/browse/HDFS-3441
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: suja s
>
> Standby NN has got the ledgerlist with list of all files, including the 
> inprogress file (with say inprogress_val1)
> Active NN has done finalization and created new inprogress file.
> Standby when proceeds further finds that the inprogress file which it had in 
> the list is not present and NN gets shutdown
> NN Logs
> =
> 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Image file of size 201 saved in 0 seconds.
> 2012-05-17 22:15:03,874 INFO 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll 
> on remote NameNode /xx.xx.xx.102:8020
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
> retain 2 images with txid >= 111
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
> image 
> FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
>  cpktTxId=109)
> 2012-05-17 22:15:03,961 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 
> failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
>  stream=null))
> java.io.IOException: Exception reading ledger list from zk
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
>   at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
>   at 
> org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
>   ... 16 more
> 2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> ZK Data
> 
> [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
> -40;59;116
> cZxid = 0x2be
> ctime = Thu May 17 22:15:03 IST 2012
> mZxid = 0x2be
> mtime = Thu May 17 22:15:03 IST 2012
> pZxid = 0x2be
> cversion = 0

[jira] [Commented] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-17 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277703#comment-13277703
 ] 

Ivan Kelly commented on HDFS-3423:
--

This can't happen, as maxid always grows unless it is reset() which is only 
ever called with maxid.get()-1. And this situation will only occur when the 
last inprogress znode points to a ledger entry.

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-15 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3423:
-

Status: Patch Available  (was: Open)

This patch requires HDFS-3058 to apply

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-15 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3423:
-

Attachment: HDFS-3423.diff

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Moved] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-15 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly moved BOOKKEEPER-244 to HDFS-3423:
-

Target Version/s: 2.0.0, 3.0.0
 Key: HDFS-3423  (was: BOOKKEEPER-244)
 Project: Hadoop HDFS  (was: Bookkeeper)

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Rakesh R
>Assignee: Ivan Kelly
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2717) BookKeeper Journal output stream doesn't check addComplete rc

2012-05-14 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-2717:
-

Fix Version/s: (was: 0.24.0)
   3.0.0
   2.0.0

> BookKeeper Journal output stream doesn't check addComplete rc
> -
>
> Key: HDFS-2717
> URL: https://issues.apache.org/jira/browse/HDFS-2717
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 2.0.0, 3.0.0
>
> Attachments: HDFS-2717.2.diff, HDFS-2717.diff
>
>
> As summary says, we're not checking the addComplete return code, so there's a 
> change of losing update. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3058) HA: Bring BookKeeperJournalManager up to date with HA changes

2012-05-14 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274530#comment-13274530
 ] 

Ivan Kelly commented on HDFS-3058:
--

Regarding the patch applying, it depends on HDFS-2717, which fixes error 
handling on the bk output stream. This is necessary for the tests in HDFS-3058 
to work. 

Regarding 2) it not quite possible to do it that way, as we also need to tell 
the minicluster setup to allow the test to managed the shared storage dir, so 
we'd need an override to build the configuration and an override to build the 
minicluster, which I think is messier than whats there now. 

> HA: Bring BookKeeperJournalManager up to date with HA changes
> -
>
> Key: HDFS-3058
> URL: https://issues.apache.org/jira/browse/HDFS-3058
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 0.24.0
>
> Attachments: HDFS-3058.diff, HDFS-3058.diff
>
>
> There's a couple of TODO(HA) comments in the BookKeeperJournalManager code. 
> This JIRA is to address those.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3392) BookKeeper Journal Manager is not retrying to connect to BK when BookKeeper is not available for write.

2012-05-10 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272411#comment-13272411
 ] 

Ivan Kelly commented on HDFS-3392:
--

I don't understand the description here. If the BK cluster is down, then 
there's no way to connect to it. What do you think the behaviour should be?

> BookKeeper Journal Manager is not retrying to connect to BK when BookKeeper 
> is not available for write.
> ---
>
> Key: HDFS-3392
> URL: https://issues.apache.org/jira/browse/HDFS-3392
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: surendra singh lilhore
>
> Scenario:
> 1. Start 3 bookKeeper and 3 zookeeper.
> 2. Start one NN as active & second NN as standby.
> 3. Write some file.
> 4. Stop all BookKeepers.
> Issue:
> Bookkeeper Journal Manager is not retrying to connect to BK when Bookkeeper 
> is not available for write and Active namenode is shutdown.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3386) BK JM : Namenode is not deleting his lock entry '/ledgers/lock/lock-0000X', when fails to acquire the lock

2012-05-10 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3386:
-

Fix Version/s: (was: 0.23.0)
   3.0.0
   2.0.0
 Assignee: Ivan Kelly
   Status: Patch Available  (was: Open)

> BK JM : Namenode is not deleting his lock entry '/ledgers/lock/lock-X', 
> when fails to acquire the lock
> --
>
> Key: HDFS-3386
> URL: https://issues.apache.org/jira/browse/HDFS-3386
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: surendra singh lilhore
>Assignee: Ivan Kelly
>Priority: Minor
> Fix For: 2.0.0, 3.0.0
>
> Attachments: HDFS-3386.diff
>
>
> When a Standby NN becomes Active, it will first create his sequential lock 
> entry create lock-000X  in ZK and then tries to acquire the lock as shown 
> below:
> {quote}
> myznode = zkc.create(lockpath + "/lock-", new byte[] {'0'},
>  Ids.OPEN_ACL_UNSAFE,
>  CreateMode.EPHEMERAL_SEQUENTIAL);
> if ((lockpath + "/" + nodes.get(0)).equals(myznode)) {
> if (LOG.isTraceEnabled()) {
> LOG.trace("Lock acquired - " + myznode);
> }
> lockCount.set(1);
> zkc.exists(myznode, this);
> return;
> } else {
> LOG.error("Failed to acquire lock with " + myznode
> + ", " + nodes.get(0) + " already has it");
> throw new IOException("Could not acquire lock");
> }  
> {quote}
> Say the transition to standby fails to acquire the lock it will throw the 
> exception and NN is getting shutdown. Here the problem is, the lock entry 
> lock-000X will exists in the ZK till session expiry and the further start-up 
> will not be able to acquire lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3386) BK JM : Namenode is not deleting his lock entry '/ledgers/lock/lock-0000X', when fails to acquire the lock

2012-05-10 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3386:
-

Attachment: HDFS-3386.diff

Patch applies on top of HDFS-3058

> BK JM : Namenode is not deleting his lock entry '/ledgers/lock/lock-X', 
> when fails to acquire the lock
> --
>
> Key: HDFS-3386
> URL: https://issues.apache.org/jira/browse/HDFS-3386
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: surendra singh lilhore
>Priority: Minor
> Fix For: 2.0.0, 3.0.0
>
> Attachments: HDFS-3386.diff
>
>
> When a Standby NN becomes Active, it will first create his sequential lock 
> entry create lock-000X  in ZK and then tries to acquire the lock as shown 
> below:
> {quote}
> myznode = zkc.create(lockpath + "/lock-", new byte[] {'0'},
>  Ids.OPEN_ACL_UNSAFE,
>  CreateMode.EPHEMERAL_SEQUENTIAL);
> if ((lockpath + "/" + nodes.get(0)).equals(myznode)) {
> if (LOG.isTraceEnabled()) {
> LOG.trace("Lock acquired - " + myznode);
> }
> lockCount.set(1);
> zkc.exists(myznode, this);
> return;
> } else {
> LOG.error("Failed to acquire lock with " + myznode
> + ", " + nodes.get(0) + " already has it");
> throw new IOException("Could not acquire lock");
> }  
> {quote}
> Say the transition to standby fails to acquire the lock it will throw the 
> exception and NN is getting shutdown. Here the problem is, the lock entry 
> lock-000X will exists in the ZK till session expiry and the further start-up 
> will not be able to acquire lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3058) HA: Bring BookKeeperJournalManager up to date with HA changes

2012-05-10 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-3058:
-

Attachment: HDFS-3058.diff

Rebased onto trunk

> HA: Bring BookKeeperJournalManager up to date with HA changes
> -
>
> Key: HDFS-3058
> URL: https://issues.apache.org/jira/browse/HDFS-3058
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 0.24.0
>
> Attachments: HDFS-3058.diff, HDFS-3058.diff
>
>
> There's a couple of TODO(HA) comments in the BookKeeperJournalManager code. 
> This JIRA is to address those.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-234) Integration with BookKeeper logging system

2012-05-08 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-234:


Attachment: HDFS-234-branch-2.patch

This patch is a direct copy of the bkjournal code from trunk on the 8th May 
2012. To be applied on branch-2.

> Integration with BookKeeper logging system
> --
>
> Key: HDFS-234
> URL: https://issues.apache.org/jira/browse/HDFS-234
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Luca Telloli
>Assignee: Ivan Kelly
> Fix For: 3.0.0
>
> Attachments: HADOOP-5189-trunk-preview.patch, 
> HADOOP-5189-trunk-preview.patch, HADOOP-5189-trunk-preview.patch, 
> HADOOP-5189-v.19.patch, HADOOP-5189.patch, HDFS-234-branch-2.patch, 
> HDFS-234.diff, HDFS-234.diff, HDFS-234.diff, HDFS-234.diff, HDFS-234.diff, 
> HDFS-234.patch, create.png, hdfs_tpt_lat.pdf, zookeeper-dev-bookkeeper.jar, 
> zookeeper-dev.jar
>
>
> BookKeeper is a system to reliably log streams of records 
> (https://issues.apache.org/jira/browse/ZOOKEEPER-276). The NameNode is a 
> natural target for such a system for being the metadata repository of the 
> entire file system for HDFS. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2743) Streamline usage of bookkeeper journal manager

2012-04-26 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262722#comment-13262722
 ] 

Ivan Kelly commented on HDFS-2743:
--

{quote}
I think you already handled them as part of other issues, like HDFS-2717, 
HDFS-3058.
{quote}
Yup, HA won't work without HDFS-3058

> Streamline usage of bookkeeper journal manager
> --
>
> Key: HDFS-2743
> URL: https://issues.apache.org/jira/browse/HDFS-2743
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 0.24.0
>
> Attachments: HDFS-2743.diff, HDFS-2743.diff
>
>
> The current method of installing bkjournal manager involves generating a 
> tarball, and extracting it with special flags over the hdfs distribution. 
> This is cumbersome and prone to being broken by other changes (see 
> https://svn.apache.org/repos/asf/hadoop/common/trunk@1220940). I think a 
> cleaner way to doing this is to generate a single jar that can be placed in 
> the lib dir of hdfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2188) HDFS-1580: Make FSEditLog create its journals from a list of URIs rather than NNStorage

2011-09-26 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated HDFS-2188:
-

Attachment: HDFS-2188.diff

Removed storage of conf for the moment. I've left conf as a parameter to 
FSEditLog, as generic journals will need it, and I don't want to change all the 
constructors in TestEditLog again. Also, i've added javadoc for FSEditLog()

> HDFS-1580: Make FSEditLog create its journals from a list of URIs rather than 
> NNStorage
> ---
>
> Key: HDFS-2188
> URL: https://issues.apache.org/jira/browse/HDFS-2188
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 0.23.0
>
> Attachments: HDFS-2188.diff, HDFS-2188.diff, HDFS-2188.diff, 
> HDFS-2188.diff
>
>
> Currently, FSEditLog retrieves the list of Journals to create from NNStorage. 
> Obviously this is file specific. This JIRA aims to remove this restriction to 
> make it possible to create journals of custom types.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2158) Add JournalSet to manage the set of journals.

2011-09-26 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114563#comment-13114563
 ] 

Ivan Kelly commented on HDFS-2158:
--

Why not get rid of journal stream completely? Perhaps we can do this in a later 
JIRA.

Still have a findbug warning because of the custom synchronization in logSync.

 
 
   
   
   
 
 
 You've added a new log statement in EditLogFileOutputStream#flushAndSync. 
Shouldn't it be a LOG.warn()? There should be any instance of calling flush 
without data having been written to the stream first. The only case it could 
happen is when the stream has errored on a previous write, but it should have 
been removed from JournalSetOutputStream at that point. In fact, Im not sure 
how this worked in the old code either. 

In FSEditLog#close you removed the assert for !journals.isEmpty. Instead you 
could put assert editLogStream != null;

In FSEditLog#logEdit, you catch the IOException, log and continue as normal. I 
can't see where the error is handled later. I don't think 
editLogStream.setReadyToFlush() will throw an exception if no active journals 
are found.

JournalSet#getSyncTimes doesn't return anything, so it's not really a getter. I 
think this would be cleaner if getSyncTimes() just returned a String rather 
than taking in a StringBuilder.

{quote}
The finalizeLogSegments takes parameters which are not available in close 
method. 
{quote}
The last txid should be available to stream, as it must have passed through 
write(). Same for first tx id. In any case, it's not that important to me, just 
something I thought would make things nicer.


> Add JournalSet to manage the set of journals.
> -
>
> Key: HDFS-2158
> URL: https://issues.apache.org/jira/browse/HDFS-2158
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jitendra Nath Pandey
>Assignee: Jitendra Nath Pandey
> Attachments: HDFS-2158.1.patch, HDFS-2158.3.patch, HDFS-2158.4.patch, 
> HDFS-2158.8.patch, HDFS-2158.9.patch
>
>
> The management of the collection of journals can be encapsulated in a 
> JournalSet. This will cleanup the FSEditLog code significantly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2158) Add JournalSet to manage the set of journals.

2011-09-26 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114564#comment-13114564
 ] 

Ivan Kelly commented on HDFS-2158:
--

Forgot to say, the xml above should be added to 
hadoop-hdfs/dev-support/findbugsExcludeFile.xml


> Add JournalSet to manage the set of journals.
> -
>
> Key: HDFS-2158
> URL: https://issues.apache.org/jira/browse/HDFS-2158
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jitendra Nath Pandey
>Assignee: Jitendra Nath Pandey
> Attachments: HDFS-2158.1.patch, HDFS-2158.3.patch, HDFS-2158.4.patch, 
> HDFS-2158.8.patch, HDFS-2158.9.patch
>
>
> The management of the collection of journals can be encapsulated in a 
> JournalSet. This will cleanup the FSEditLog code significantly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2158) Add JournalSet to manage the set of journals.

2011-09-21 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109656#comment-13109656
 ] 

Ivan Kelly commented on HDFS-2158:
--

Looks like a pretty straight forward refactor. A few comments.

L518 & L276
I see that you create a single editLogStream at the start and have 
startLogSegment change it in the background. This is very side-effecty. How 
about having startLogSegment return an EditLogOutputStream and assigning it to 
editLogStream? Likewise on like L536, just call editLogStream.close(); and have 
your implementation of the stream interface call finalize for the stream? It 
would match better with L556 also. 

L409 
Are we losing information here, previously we'd get numbers for each journal, 
now we only get one?

L1099 & L1086
Looks like a debugging statement. 


> Add JournalSet to manage the set of journals.
> -
>
> Key: HDFS-2158
> URL: https://issues.apache.org/jira/browse/HDFS-2158
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jitendra Nath Pandey
>Assignee: Jitendra Nath Pandey
> Attachments: HDFS-2158.1.patch, HDFS-2158.3.patch, HDFS-2158.4.patch
>
>
> The management of the collection of journals can be encapsulated in a 
> JournalSet. This will cleanup the FSEditLog code significantly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2188) HDFS-1580: Make FSEditLog create its journals from a list of URIs rather than NNStorage

2011-09-21 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109626#comment-13109626
 ] 

Ivan Kelly commented on HDFS-2188:
--

The mechanism for loading the journal from the type of the url is the next 
change I have lined up after this. The storage of the conf is also used for 
this. I put it in now, because I had to change all the tests that constructed 
FSEditLog now for the URI list, and would have had to do it again when I added 
the conf. It'll make the next patch smaller. The next patch depends on this 
patch and on HDFS-2334 so I wanted to get at least one of those in before 
generating it.

> HDFS-1580: Make FSEditLog create its journals from a list of URIs rather than 
> NNStorage
> ---
>
> Key: HDFS-2188
> URL: https://issues.apache.org/jira/browse/HDFS-2188
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 0.23.0
>
> Attachments: HDFS-2188.diff, HDFS-2188.diff, HDFS-2188.diff
>
>
> Currently, FSEditLog retrieves the list of Journals to create from NNStorage. 
> Obviously this is file specific. This JIRA aims to remove this restriction to 
> make it possible to create journals of custom types.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2333) HDFS-2284 introduced 2 findbugs warnings on trunk

2011-09-15 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105221#comment-13105221
 ] 

Ivan Kelly commented on HDFS-2333:
--

Looks good to me. +1 
Perhaps add to the DFSClient#append javadoc that null is acceptable for the 3rd 
and 4th parameter.

> HDFS-2284 introduced 2 findbugs warnings on trunk
> -
>
> Key: HDFS-2333
> URL: https://issues.apache.org/jira/browse/HDFS-2333
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-2333.diff, h2333_20110914.patch, 
> h2333_20110914b.patch
>
>
> When HDFS-2284 was submitted it made DFSOutputStream public which triggered 
> two SC_START_IN_CTOR findbug warnings.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2334) Add Closeable to JournalManager

2011-09-14 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104639#comment-13104639
 ] 

Ivan Kelly commented on HDFS-2334:
--

findbugs are unrelated, see HDFS-2333

Failed tests have been failing for about a week at this stage.

This patch has added no new tests as the close methods on current 
JournalManagers are empty at the moment. HDFS-234 will add a JournalManager 
which makes use of it, along with tests for this new JournalManager.

> Add Closeable to JournalManager
> ---
>
> Key: HDFS-2334
> URL: https://issues.apache.org/jira/browse/HDFS-2334
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Fix For: 0.23.0
>
> Attachments: HDFS-2334.diff
>
>
> A JournalManager may take hold of resources for the duration of their 
> lifetime. This isn't the case at the moment for FileJournalManager, but 
> BookKeeperJournalManager will, and it's conceivable that FileJournalManager 
> could take a lock on a directory etc. 
> This JIRA is to add Closeable to JournalManager so that these resources can 
> be cleaned up when FSEditLog is closed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   3   4   >