[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-27 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3860:


Attachment: HDFS-heartbeat-testcase.patch

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-27 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3860:


Attachment: HDFS-3860.patch

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3703) Decrease the datanode failure detection time

2012-08-27 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3703:


Attachment: HDFS-3703.patch

This patch handles the stale nodes for reading in a straight way. It adds two 
configuration parameters to indicate whether to detect stale nodes and the time 
interval for treating nodes as stale nodes. And the 
DatanodeManager#sortLocatedBlocks method will check if the datanodes are stale 
and move possible stale nodes to the end of the list.

 Decrease the datanode failure detection time
 

 Key: HDFS-3703
 URL: https://issues.apache.org/jira/browse/HDFS-3703
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node, name-node
Affects Versions: 1.0.3, 2.0.0-alpha
Reporter: nkeywal
Assignee: Suresh Srinivas
 Attachments: HDFS-3703.patch


 By default, if a box dies, the datanode will be marked as dead by the 
 namenode after 10:30 minutes. In the meantime, this datanode will still be 
 proposed  by the nanenode to write blocks or to read replicas. It happens as 
 well if the datanode crashes: there is no shutdown hooks to tell the nanemode 
 we're not there anymore.
 It especially an issue with HBase. HBase regionserver timeout for production 
 is often 30s. So with these configs, when a box dies HBase starts to recover 
 after 30s and, while 10 minutes, the namenode will consider the blocks on the 
 same box as available. Beyond the write errors, this will trigger a lot of 
 missed reads:
 - during the recovery, HBase needs to read the blocks used on the dead box 
 (the ones in the 'HBase Write-Ahead-Log')
 - after the recovery, reading these data blocks (the 'HBase region') will 
 fail 33% of the time with the default number of replica, slowering the data 
 access, especially when the errors are socket timeout (i.e. around 60s most 
 of the time). 
 Globally, it would be ideal if HDFS settings could be under HBase settings. 
 As a side note, HBase relies on ZooKeeper to detect regionservers issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443292#comment-13443292
 ] 

Jing Zhao commented on HDFS-3860:
-

I just checked all the invocation of namesystem#writelock / writeunlock, and 
did not find similar problems. I will check other similar code too.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-3887) Remove redundant chooseTarget methods in BlockPlacementPolicy.java

2012-09-03 Thread Jing Zhao (JIRA)
Jing Zhao created HDFS-3887:
---

 Summary: Remove redundant chooseTarget methods in 
BlockPlacementPolicy.java
 Key: HDFS-3887
 URL: https://issues.apache.org/jira/browse/HDFS-3887
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Trivial


BlockPlacementPolicy.java contains multiple chooseTarget() methods with 
different parameter lists. It is difficult to follow and understand the code 
since some chooseTarget methods only have minor differences and some of them 
are only invoked by testing code. 

In this patch, I try to remove some of the chooseTarget methods and only keep 
three of them: two abstract methods and the third one using BlockCollection as 
its parameter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3887) Remove redundant chooseTarget methods in BlockPlacementPolicy.java

2012-09-03 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3887:


Attachment: HDFS-3887.patch

 Remove redundant chooseTarget methods in BlockPlacementPolicy.java
 --

 Key: HDFS-3887
 URL: https://issues.apache.org/jira/browse/HDFS-3887
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Trivial
 Attachments: HDFS-3887.patch


 BlockPlacementPolicy.java contains multiple chooseTarget() methods with 
 different parameter lists. It is difficult to follow and understand the code 
 since some chooseTarget methods only have minor differences and some of them 
 are only invoked by testing code. 
 In this patch, I try to remove some of the chooseTarget methods and only keep 
 three of them: two abstract methods and the third one using BlockCollection 
 as its parameter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3887) Remove redundant chooseTarget methods in BlockPlacementPolicy.java

2012-09-04 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13447521#comment-13447521
 ] 

Jing Zhao commented on HDFS-3887:
-

Have rerun the two failed tests in my local machine and both tests passed.

 Remove redundant chooseTarget methods in BlockPlacementPolicy.java
 --

 Key: HDFS-3887
 URL: https://issues.apache.org/jira/browse/HDFS-3887
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Trivial
 Attachments: HDFS-3887.patch


 BlockPlacementPolicy.java contains multiple chooseTarget() methods with 
 different parameter lists. It is difficult to follow and understand the code 
 since some chooseTarget methods only have minor differences and some of them 
 are only invoked by testing code. 
 In this patch, I try to remove some of the chooseTarget methods and only keep 
 three of them: two abstract methods and the third one using BlockCollection 
 as its parameter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-3888) BlockPlacementPolicyDefault#LOG should be removed

2012-09-04 Thread Jing Zhao (JIRA)
Jing Zhao created HDFS-3888:
---

 Summary: BlockPlacementPolicyDefault#LOG should be removed
 Key: HDFS-3888
 URL: https://issues.apache.org/jira/browse/HDFS-3888
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor


BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the base 
class BlockPlacementPolicy. Also, in BlockPlacementPolicyDefault#chooseTarget 
method, the logic computing the maxTargetPerLoc can be made a separate method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault#LOG should be removed

2012-09-04 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3888:


Attachment: HDFS-3888.patch

 BlockPlacementPolicyDefault#LOG should be removed
 -

 Key: HDFS-3888
 URL: https://issues.apache.org/jira/browse/HDFS-3888
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-3888.patch


 BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the 
 base class BlockPlacementPolicy. Also, in 
 BlockPlacementPolicyDefault#chooseTarget method, the logic computing the 
 maxTargetPerLoc can be made a separate method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault#LOG should be removed

2012-09-04 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3888:


Status: Patch Available  (was: Open)

 BlockPlacementPolicyDefault#LOG should be removed
 -

 Key: HDFS-3888
 URL: https://issues.apache.org/jira/browse/HDFS-3888
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-3888.patch


 BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the 
 base class BlockPlacementPolicy. Also, in 
 BlockPlacementPolicyDefault#chooseTarget method, the logic computing the 
 maxTargetPerLoc can be made a separate method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault code cleanup

2012-09-04 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3888:


Status: Open  (was: Patch Available)

 BlockPlacementPolicyDefault code cleanup
 

 Key: HDFS-3888
 URL: https://issues.apache.org/jira/browse/HDFS-3888
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-3888.patch


 BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the 
 base class BlockPlacementPolicy. Also, in 
 BlockPlacementPolicyDefault#chooseTarget method, the logic computing the 
 maxTargetPerLoc can be made a separate method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3888) BlockPlacementPolicyDefault code cleanup

2012-09-04 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3888:


Attachment: HDFS-3888.patch

The code for computing the maxNodePerRack in chooseTarget() method is putting 
back because it may also change the value of numOfReplicas.

 BlockPlacementPolicyDefault code cleanup
 

 Key: HDFS-3888
 URL: https://issues.apache.org/jira/browse/HDFS-3888
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-3888.patch, HDFS-3888.patch


 BlockPlacementPolicyDefault#LOG should be removed as it hides LOG from the 
 base class BlockPlacementPolicy. Also, in 
 BlockPlacementPolicyDefault#chooseTarget method, the logic computing the 
 maxTargetPerLoc can be made a separate method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HDFS-2656) Implement a pure c client based on webhdfs

2012-09-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao reassigned HDFS-2656:
---

Assignee: Jing Zhao

 Implement a pure c client based on webhdfs
 --

 Key: HDFS-2656
 URL: https://issues.apache.org/jira/browse/HDFS-2656
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: webhdfs
Reporter: Zhanwei.Wang
Assignee: Jing Zhao
 Attachments: HDFS-2656.patch, HDFS-2656.patch, 
 HDFS-2656.unfinished.patch, teragen_terasort_teravalidate_performance.png


 Currently, the implementation of libhdfs is based on JNI. The overhead of JVM 
 seems a little big, and libhdfs can also not be used in the environment 
 without hdfs.
 It seems a good idea to implement a pure c client by wrapping webhdfs. It 
 also can be used to access different version of hdfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-2656) Implement a pure c client based on webhdfs

2012-09-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-2656:


Affects Version/s: 3.0.0
   Status: Patch Available  (was: Open)

 Implement a pure c client based on webhdfs
 --

 Key: HDFS-2656
 URL: https://issues.apache.org/jira/browse/HDFS-2656
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: webhdfs
Affects Versions: 3.0.0
Reporter: Zhanwei.Wang
Assignee: Jing Zhao
 Attachments: HDFS-2656.patch, HDFS-2656.patch, 
 HDFS-2656.unfinished.patch, teragen_terasort_teravalidate_performance.png


 Currently, the implementation of libhdfs is based on JNI. The overhead of JVM 
 seems a little big, and libhdfs can also not be used in the environment 
 without hdfs.
 It seems a good idea to implement a pure c client by wrapping webhdfs. It 
 also can be used to access different version of hdfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-6041) Downgrade/Finalize should rename the rollback image instead of purging it

2014-03-03 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6041:


Attachment: HDFS-6041.000.patch

 Downgrade/Finalize should rename the rollback image instead of purging it
 -

 Key: HDFS-6041
 URL: https://issues.apache.org/jira/browse/HDFS-6041
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6041.000.patch


 After we do rollingupgrade downgrade/finalize, instead of purging the 
 rollback image, we'd better rename it back to normal image, since the 
 rollback image can be the most recent checkpoint.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-03 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.001.patch

Update bkjournal to fix compilation errors.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-03 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918912#comment-13918912
 ] 

Jing Zhao commented on HDFS-6038:
-

One issue with the current patch is that the JN will also check the 
layoutversion locally while serving read requests. Let me see if we can bypass 
this check in JN.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6041) Downgrade/Finalize should rename the rollback image instead of purging it

2014-03-03 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6041:


Attachment: HDFS-6041.001.patch

Thanks for the review, Nicholas! Update the patch to address your comments.

 Downgrade/Finalize should rename the rollback image instead of purging it
 -

 Key: HDFS-6041
 URL: https://issues.apache.org/jira/browse/HDFS-6041
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6041.000.patch, HDFS-6041.001.patch


 After we do rollingupgrade downgrade/finalize, instead of purging the 
 rollback image, we'd better rename it back to normal image, since the 
 rollback image can be the most recent checkpoint.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HDFS-6041) Downgrade/Finalize should rename the rollback image instead of purging it

2014-03-03 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao resolved HDFS-6041.
-

Resolution: Fixed

Thanks Nicholas! I've committed this.

 Downgrade/Finalize should rename the rollback image instead of purging it
 -

 Key: HDFS-6041
 URL: https://issues.apache.org/jira/browse/HDFS-6041
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6041.000.patch, HDFS-6041.001.patch


 After we do rollingupgrade downgrade/finalize, instead of purging the 
 rollback image, we'd better rename it back to normal image, since the 
 rollback image can be the most recent checkpoint.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2

2014-03-04 Thread Jing Zhao (JIRA)
Jing Zhao created HDFS-6053:
---

 Summary: Fix TestDecommissioningStatus and TestDecommission in 
branch-2
 Key: HDFS-6053
 URL: https://issues.apache.org/jira/browse/HDFS-6053
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao


The failure is caused by the backport of HDFS-5285. In 
BlockManager#isReplicationInProgress, if (bc instanceof 
MutableBlockCollection) should be replaced by if (bc.isUnderConstruction()).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2

2014-03-04 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6053:


Attachment: HDFS-6053.000.patch

A simple patch to fix.

 Fix TestDecommissioningStatus and TestDecommission in branch-2
 --

 Key: HDFS-6053
 URL: https://issues.apache.org/jira/browse/HDFS-6053
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6053.000.patch


 The failure is caused by the backport of HDFS-5285. In 
 BlockManager#isReplicationInProgress, if (bc instanceof 
 MutableBlockCollection) should be replaced by if 
 (bc.isUnderConstruction()).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2

2014-03-04 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919809#comment-13919809
 ] 

Jing Zhao commented on HDFS-6053:
-

In my local test the patch can fix the two failed unit tests, and with the 
change the code is consistent with trunk.

 Fix TestDecommissioningStatus and TestDecommission in branch-2
 --

 Key: HDFS-6053
 URL: https://issues.apache.org/jira/browse/HDFS-6053
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6053.000.patch


 The failure is caused by the backport of HDFS-5285. In 
 BlockManager#isReplicationInProgress, if (bc instanceof 
 MutableBlockCollection) should be replaced by if 
 (bc.isUnderConstruction()).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HDFS-6053) Fix TestDecommissioningStatus and TestDecommission in branch-2

2014-03-04 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao resolved HDFS-6053.
-

   Resolution: Fixed
Fix Version/s: 2.4.0

Thanks for the review, Nicholas! I've committed this to branch-2 and 
branch-2.4.0.

 Fix TestDecommissioningStatus and TestDecommission in branch-2
 --

 Key: HDFS-6053
 URL: https://issues.apache.org/jira/browse/HDFS-6053
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.4.0

 Attachments: HDFS-6053.000.patch


 The failure is caused by the backport of HDFS-5285. In 
 BlockManager#isReplicationInProgress, if (bc instanceof 
 MutableBlockCollection) should be replaced by if 
 (bc.isUnderConstruction()).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6043) Give HDFS daemons NFS3 and Portmap their own OPTS

2014-03-04 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920148#comment-13920148
 ] 

Jing Zhao commented on HDFS-6043:
-

nit: export HADOOP_NFS3_OPTS= $HADOOP_NFS3_OPTS there is an extra space 
before $. Other than that +1.

 Give HDFS daemons NFS3 and Portmap their own OPTS
 -

 Key: HDFS-6043
 URL: https://issues.apache.org/jira/browse/HDFS-6043
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: nfs
Reporter: Brandon Li
Assignee: Brandon Li
 Attachments: HDFS-6043.001.patch


 Like some other HDFS services, the OPTS makes it easier for the users to 
 update resource related settings for the NFS gateway. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6044) Add property for setting the NFS look up time for users

2014-03-04 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920163#comment-13920163
 ] 

Jing Zhao commented on HDFS-6044:
-

The patch looks good to me. Some minor comments:
# We may want to define the constant string somewhere in the following code?
{code}
+timeout = conf.getLong(hadoop.nfs3.userupdate.ms, TIMEOUT_DEFAULT);
{code}
# Can we declare timeout as final and initialize it in the two constructors?

Other than that +1.

 Add property for setting the NFS look up time for users
 ---

 Key: HDFS-6044
 URL: https://issues.apache.org/jira/browse/HDFS-6044
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: nfs
Reporter: Brandon Li
Assignee: Brandon Li
Priority: Minor
 Attachments: HDFS-6044.001.patch, HDFS-6044.002.patch


 Currently NFS gateway refresh the user account every 15 minutes. Add a 
 property to make it tunable in different environments. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-5167) Add metrics about the NameNode retry cache

2014-03-05 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921142#comment-13921142
 ] 

Jing Zhao commented on HDFS-5167:
-

+1 for the latest patch. I will commit it soon.

 Add metrics about the NameNode retry cache
 --

 Key: HDFS-5167
 URL: https://issues.apache.org/jira/browse/HDFS-5167
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, namenode
Affects Versions: 3.0.0, 2.3.0, 2.4.0
Reporter: Jing Zhao
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: HDFS-5167.1.patch, HDFS-5167.10.patch, 
 HDFS-5167.11.patch, HDFS-5167.12.patch, HDFS-5167.2.patch, HDFS-5167.3.patch, 
 HDFS-5167.4.patch, HDFS-5167.5.patch, HDFS-5167.6.patch, HDFS-5167.6.patch, 
 HDFS-5167.7.patch, HDFS-5167.8.patch, HDFS-5167.9-2.patch, HDFS-5167.9.patch


 It will be helpful to have metrics in NameNode about the retry cache, such as 
 the retry count etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-5167) Add metrics about the NameNode retry cache

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-5167:


   Resolution: Fixed
Fix Version/s: 2.4.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Thanks for the great work, [~ozawa]! I've committed this to trunk, branch-2 and 
branch-2.4.0.

 Add metrics about the NameNode retry cache
 --

 Key: HDFS-5167
 URL: https://issues.apache.org/jira/browse/HDFS-5167
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, namenode
Affects Versions: 3.0.0, 2.3.0, 2.4.0
Reporter: Jing Zhao
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Fix For: 2.4.0

 Attachments: HDFS-5167.1.patch, HDFS-5167.10.patch, 
 HDFS-5167.11.patch, HDFS-5167.12.patch, HDFS-5167.2.patch, HDFS-5167.3.patch, 
 HDFS-5167.4.patch, HDFS-5167.5.patch, HDFS-5167.6.patch, HDFS-5167.6.patch, 
 HDFS-5167.7.patch, HDFS-5167.8.patch, HDFS-5167.9-2.patch, HDFS-5167.9.patch


 It will be helpful to have metrics in NameNode about the retry cache, such as 
 the retry count etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6058) Fix TestHDFSCLI failures after HADOOP-8691 change

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6058:


Affects Version/s: 2.4.0

 Fix TestHDFSCLI failures after HADOOP-8691 change
 -

 Key: HDFS-6058
 URL: https://issues.apache.org/jira/browse/HDFS-6058
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Vinayakumar B
Assignee: Akira AJISAKA
 Attachments: HDFS-6058.000.patch, HDFS-6058.patch


 HADOOP-8691 changed the ls command output.
 TestHDFSCLI needs to be updated after this change,
 Latest precommit builds are failing because of this.
 https://builds.apache.org/job/PreCommit-HDFS-Build/6305//testReport/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6058) Fix TestHDFSCLI failures after HADOOP-8691 change

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6058:


Status: Patch Available  (was: Open)

 Fix TestHDFSCLI failures after HADOOP-8691 change
 -

 Key: HDFS-6058
 URL: https://issues.apache.org/jira/browse/HDFS-6058
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinayakumar B
Assignee: Akira AJISAKA
 Attachments: HDFS-6058.000.patch, HDFS-6058.patch


 HADOOP-8691 changed the ls command output.
 TestHDFSCLI needs to be updated after this change,
 Latest precommit builds are failing because of this.
 https://builds.apache.org/job/PreCommit-HDFS-Build/6305//testReport/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-5653) Log namenode hostname in various exceptions being thrown in a HA setup

2014-03-05 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921565#comment-13921565
 ] 

Jing Zhao commented on HDFS-5653:
-

+1 the new Patch looks good to me.

 Log namenode hostname in various exceptions being thrown in a HA setup
 --

 Key: HDFS-5653
 URL: https://issues.apache.org/jira/browse/HDFS-5653
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha
Affects Versions: 2.2.0
Reporter: Arpit Gupta
Assignee: Haohui Mai
Priority: Minor
 Attachments: HDFS-5653.000.patch, HDFS-5653.001.patch, 
 HDFS-5653.002.patch, HDFS-5653.003.patch, HDFS-5653.004.patch, 
 HDFS-5653.005.patch, HDFS-5653.006.patch


 In a HA setup any time we see an exception such as safemode or namenode in 
 standby etc we dont know which namenode it came from. The user has to go to 
 the logs of the namenode and determine which one was active and/or standby 
 around the same time.
 I think it would help with debugging if any such exceptions could include the 
 namenode hostname so the user could know exactly which namenode served the 
 request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.002.patch

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-05 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921602#comment-13921602
 ] 

Jing Zhao commented on HDFS-6038:
-

After offline discussion with [~szetszwo], the 002 patch did the following 
change:
1. Persist a length field for each editlog op to indicate the total length of 
the op
2. In JournalNode, instead of calling EditLogFileInputStream#validateEditLog to 
get the last txid of an in-progress editlog segment, we add a new method 
scanEditLog which does not decode each editlog op. Instead, the new method 
reads the length and txid of each op, and use the length to quickly jump to the 
next op.

The 002 patch is just a preliminary patch to demonstrate the idea. Still need 
to fix unit tests and run system tests.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Status: Patch Available  (was: Open)

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Status: Open  (was: Patch Available)

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Status: Patch Available  (was: Open)

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.003.patch
editsStored

Fix some unit tests.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6062) TestRetryCacheWithHA#testConcat is flaky

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6062:


Attachment: HDFS-6062.000.patch

The patch checks the length of the target file to make sure concat has been 
processed in NN0.

 TestRetryCacheWithHA#testConcat is flaky
 

 Key: HDFS-6062
 URL: https://issues.apache.org/jira/browse/HDFS-6062
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-6062.000.patch


 After adding retry cache metrics check, TestRetryCacheWithHA#testConcat can 
 fail (https://builds.apache.org/job/PreCommit-HDFS-Build/6313//testReport/).
 The reason is that the test uses dfs.exists(targetPath) to check whether 
 concat has been done in the original active NN. However, since we create the 
 target file in the beginning, the check always returns true. Thus it is 
 possible that the concat is processed in the new active NN for the first 
 time. And in this case the retry cache will not be hit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6062) TestRetryCacheWithHA#testConcat is flaky

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6062:


Affects Version/s: 2.4.0
   Status: Patch Available  (was: Open)

 TestRetryCacheWithHA#testConcat is flaky
 

 Key: HDFS-6062
 URL: https://issues.apache.org/jira/browse/HDFS-6062
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-6062.000.patch


 After adding retry cache metrics check, TestRetryCacheWithHA#testConcat can 
 fail (https://builds.apache.org/job/PreCommit-HDFS-Build/6313//testReport/).
 The reason is that the test uses dfs.exists(targetPath) to check whether 
 concat has been done in the original active NN. However, since we create the 
 target file in the beginning, the check always returns true. Thus it is 
 possible that the concat is processed in the new active NN for the first 
 time. And in this case the retry cache will not be hit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6062) TestRetryCacheWithHA#testConcat is flaky

2014-03-05 Thread Jing Zhao (JIRA)
Jing Zhao created HDFS-6062:
---

 Summary: TestRetryCacheWithHA#testConcat is flaky
 Key: HDFS-6062
 URL: https://issues.apache.org/jira/browse/HDFS-6062
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-6062.000.patch

After adding retry cache metrics check, TestRetryCacheWithHA#testConcat can 
fail (https://builds.apache.org/job/PreCommit-HDFS-Build/6313//testReport/).

The reason is that the test uses dfs.exists(targetPath) to check whether 
concat has been done in the original active NN. However, since we create the 
target file in the beginning, the check always returns true. Thus it is 
possible that the concat is processed in the new active NN for the first time. 
And in this case the retry cache will not be hit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-05 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.004.patch

Thanks for the review, Nicholas! Update the patch to address your comments.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6058) Fix TestHDFSCLI failures after HADOOP-8691 change

2014-03-05 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922122#comment-13922122
 ] 

Jing Zhao commented on HDFS-6058:
-

The failed test should be unrelated. +1 for the 000 patch.

 Fix TestHDFSCLI failures after HADOOP-8691 change
 -

 Key: HDFS-6058
 URL: https://issues.apache.org/jira/browse/HDFS-6058
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Vinayakumar B
Assignee: Akira AJISAKA
 Attachments: HDFS-6058.000.patch, HDFS-6058.patch


 HADOOP-8691 changed the ls command output.
 TestHDFSCLI needs to be updated after this change,
 Latest precommit builds are failing because of this.
 https://builds.apache.org/job/PreCommit-HDFS-Build/6305//testReport/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-06 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.005.patch

Update the patch to fix some unit tests and address Nicholas's comments.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-06 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: (was: editsStored)

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-06 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923479#comment-13923479
 ] 

Jing Zhao commented on HDFS-6038:
-

the editsStored binary file needs to be updated again. Will do it in the end.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-07 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923725#comment-13923725
 ] 

Jing Zhao commented on HDFS-6038:
-

bq. DataOutputBuffer.writeInt can be simplified as below.

Here we cannot call DataOutputBuffer#writeInt. The issue is that 
DataOutputStream#writeInt will increase the total number of written bytes by 4 
(DataOutputStream#written), and the total number of written bytes will later be 
retrieved by EditsDoubleBuffer#countReadyBytes, and used by 
QuorumOutputStream#flushAndSync to determine the size of the data flushed to 
JNs. Since our writeInt(int, int) method is actually modifying some previous 
data, the total number of bytes written should be unchanged. Directly calling 
DataOutputBuffer#writeInt there will append extra 4 bytes (0x) for each 
editlog transaction recorded in JNs.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-07 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.006.patch

Update the patch to fix the javadoc warning and unit test failure.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-07 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.007.patch

Added an extra check for writing length.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, ha, hdfs-client, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with un kown host exception

2014-03-07 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924442#comment-13924442
 ] 

Jing Zhao commented on HDFS-6077:
-

This is actually a similar issue with HDFS-5339: SecurityUtil#buildTokenService 
tries to resolve the name service id as a host name. Since HDFS-5339 already 
figures out the token service name for webhdfs filesystem in the 
initialization, we can simply override the getCanonicalServiceName method and 
return the tokenServiceName.

 running slive with webhdfs on secure HA cluster fails with un kown host 
 exception
 -

 Key: HDFS-6077
 URL: https://issues.apache.org/jira/browse/HDFS-6077
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Arpit Gupta
Assignee: Jing Zhao





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with un kown host exception

2014-03-07 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6077:


Status: Patch Available  (was: Open)

 running slive with webhdfs on secure HA cluster fails with un kown host 
 exception
 -

 Key: HDFS-6077
 URL: https://issues.apache.org/jira/browse/HDFS-6077
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6077.000.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with un kown host exception

2014-03-07 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6077:


Attachment: HDFS-6077.000.patch

A simple patch to fix.

 running slive with webhdfs on secure HA cluster fails with un kown host 
 exception
 -

 Key: HDFS-6077
 URL: https://issues.apache.org/jira/browse/HDFS-6077
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6077.000.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-07 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924578#comment-13924578
 ] 

Jing Zhao commented on HDFS-6038:
-

[~tlipcon], do you also want to take a look at the patch?

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-10 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Status: Open  (was: Patch Available)

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-10 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: editsStored

Upload the editStored binary file.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6072) Clean up dead code of FSImage

2014-03-10 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13926059#comment-13926059
 ] 

Jing Zhao commented on HDFS-6072:
-

The patch looks good to me. Besides the comments from [~ajisakaa], we may also 
want to clean the imports of FSNamesystem.java, and remove the  
AbstractINodeDiff#wirteSnapshot.

 Clean up dead code of FSImage
 -

 Key: HDFS-6072
 URL: https://issues.apache.org/jira/browse/HDFS-6072
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-6072.000.patch


 After HDFS-5698 HDFS store FSImage in protobuf format. The old code of saving 
 the FSImage is now dead, which should be removed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with unkown host exception

2014-03-10 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6077:


Summary: running slive with webhdfs on secure HA cluster fails with unkown 
host exception  (was: running slive with webhdfs on secure HA cluster fails 
with un kown host exception)

 running slive with webhdfs on secure HA cluster fails with unkown host 
 exception
 

 Key: HDFS-6077
 URL: https://issues.apache.org/jira/browse/HDFS-6077
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6077.000.patch


 SliveTest fails with following.  See the comment for more logs.
 {noformat}
 SliveTest: Unable to run job due to error:
 java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-2-secure
 at 
 org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
 at 
 org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:258)
 at 
 org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:299)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6077) running slive with webhdfs on secure HA cluster fails with unkown host exception

2014-03-10 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6077:


   Resolution: Fixed
Fix Version/s: 2.4.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've committed this to trunk, branch-2 and branch-2.4.0.

 running slive with webhdfs on secure HA cluster fails with unkown host 
 exception
 

 Key: HDFS-6077
 URL: https://issues.apache.org/jira/browse/HDFS-6077
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Fix For: 2.4.0

 Attachments: HDFS-6077.000.patch


 SliveTest fails with following.  See the comment for more logs.
 {noformat}
 SliveTest: Unable to run job due to error:
 java.lang.IllegalArgumentException: java.net.UnknownHostException: ha-2-secure
 at 
 org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
 at 
 org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:258)
 at 
 org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:299)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6072) Clean up dead code of FSImage

2014-03-11 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930973#comment-13930973
 ] 

Jing Zhao commented on HDFS-6072:
-

+1 for the new patch. Thanks for the cleaning, Haohui!

 Clean up dead code of FSImage
 -

 Key: HDFS-6072
 URL: https://issues.apache.org/jira/browse/HDFS-6072
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-6072.000.patch, HDFS-6072.001.patch, 
 HDFS-6072.002.patch


 After HDFS-5698 HDFS store FSImage in protobuf format. The old code of saving 
 the FSImage is now dead, which should be removed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-11 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931045#comment-13931045
 ] 

Jing Zhao commented on HDFS-6089:
-

Checked the log with Arpit. Looks like the issue is like this:
1. After NN1 got suspended, NN2 started the transition. It first tried to stop 
the editlog tailer thread.
2. The editlog tailer thread happened to trigger NN1 to roll its editlog right 
before the transition, and this rpc call got stuck since NN1 was suspended.
3. It took a relatively long time (1min) for the rollEditlog rpc call to 
receive the connection reset exception.
4. During this time, NN2 waited for the tailer thread to die, and the 
fsnamesystem lock was held by the stopStandbyService call.
5. haadmin's getServiceState request could not get response (since the lock was 
held by the transition thread in NN2) and timeout (its default socket timeout 
is 20s).

In summary, it is possible that the rollEditlog rpc call from the standby NN to 
the active NN in the editlog tailer thread may delay the NN failover.


 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao

 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-11 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931048#comment-13931048
 ] 

Jing Zhao commented on HDFS-6089:
-

Since in active NN we already have a NameNodeEditLogRoller thread triggering 
the editlog roll, I guess the standby NN doesn't need to trigger the active 
namenode to roll the editlog.

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao

 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-11 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6089:


Attachment: HDFS-6089.000.patch

Simple patch to remove the editlog roll from SBN.

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-11 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6089:


Status: Patch Available  (was: Open)

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-12 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6089:


Attachment: HDFS-6089.001.patch

Fix unit tests.

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-13 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13933577#comment-13933577
 ] 

Jing Zhao commented on HDFS-6038:
-

Thanks for the review, Todd!

bq. just worried that other contributors may want to review this patch as it's 
actually making an edit log format change, not just a protocol change for the 
JNs.
I will update the jira title and description to make them more clear about the 
changes.

bq. it might be nice to add a QJM test which writes fake ops to a JournalNode
Yeah, will update the patch to add the unit test.

 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file.
 The problem is that the JN hard-codes the version (i.e., 
 {{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
 edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
 rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) JournalNode hardcodes NameNodeLayoutVersion in the edit log file

2014-03-13 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Description: 
In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
log files. In order to write well-formed edit log files, the JNs prepend a 
header for each edit log file. The problem is that the JN hard-codes the 
version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it 
generates incorrect edit logs when the newer release bumps the 
{{NameNodeLayoutVersion}} during rolling upgrade.

In the meanwhile, currently JN tries to decode the in-progress editlog segment 
in order to know the last txid in the segment. In the rolling upgrade scenario, 
the JN with the old software may not be able to correctly decode the editlog 
generated by the new software.

This jira makes the following changes to allow JN to handle editlog produced by 
software with future layoutversion:
1. Change the NN--JN startLogSegment RPC signature and let NN specify the 
layoutversion for the new editlog segment.
2. Persist a length field for each editlog op to indicate the total length of 
the op. Instead of calling EditLogFileInputStream#validateEditLog to get the 
last txid of an in-progress editlog segment, a new method scanEditLog is added 
and used by JN which does not decode each editlog op but uses the length to 
quickly jump to the next op.

  was:
In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
log files. In order to write well-formed edit log files, the JNs prepend a 
header for each edit log file.

The problem is that the JN hard-codes the version (i.e., 
{{NameNodeLayoutVersion}} in the edit log, therefore it generates incorrect 
edit logs when the newer release bumps the {{NameNodeLayoutVersion}} during 
rolling upgrade.


 JournalNode hardcodes NameNodeLayoutVersion in the edit log file
 

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file. The problem is that the JN hard-codes the 
 version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it 
 generates incorrect edit logs when the newer release bumps the 
 {{NameNodeLayoutVersion}} during rolling upgrade.
 In the meanwhile, currently JN tries to decode the in-progress editlog 
 segment in order to know the last txid in the segment. In the rolling upgrade 
 scenario, the JN with the old software may not be able to correctly decode 
 the editlog generated by the new software.
 This jira makes the following changes to allow JN to handle editlog produced 
 by software with future layoutversion:
 1. Change the NN--JN startLogSegment RPC signature and let NN specify the 
 layoutversion for the new editlog segment.
 2. Persist a length field for each editlog op to indicate the total length of 
 the op. Instead of calling EditLogFileInputStream#validateEditLog to get the 
 last txid of an in-progress editlog segment, a new method scanEditLog is 
 added and used by JN which does not decode each editlog op but uses the 
 length to quickly jump to the next op.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion

2014-03-13 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Summary: Allow JournalNode to handle editlog produced by new release with 
future layoutversion  (was: JournalNode hardcodes NameNodeLayoutVersion in the 
edit log file)

 Allow JournalNode to handle editlog produced by new release with future 
 layoutversion
 -

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file. The problem is that the JN hard-codes the 
 version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it 
 generates incorrect edit logs when the newer release bumps the 
 {{NameNodeLayoutVersion}} during rolling upgrade.
 In the meanwhile, currently JN tries to decode the in-progress editlog 
 segment in order to know the last txid in the segment. In the rolling upgrade 
 scenario, the JN with the old software may not be able to correctly decode 
 the editlog generated by the new software.
 This jira makes the following changes to allow JN to handle editlog produced 
 by software with future layoutversion:
 1. Change the NN--JN startLogSegment RPC signature and let NN specify the 
 layoutversion for the new editlog segment.
 2. Persist a length field for each editlog op to indicate the total length of 
 the op. Instead of calling EditLogFileInputStream#validateEditLog to get the 
 last txid of an in-progress editlog segment, a new method scanEditLog is 
 added and used by JN which does not decode each editlog op but uses the 
 length to quickly jump to the next op.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion

2014-03-13 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Attachment: HDFS-6038.008.patch

Update the patch to address Todd's comments. The main change is to add a new 
unit test in TestJournal. In the new test we writes some editlog that JNs 
cannot decode, and verifies that the JN can utilize the length field to scan 
the segment.

 Allow JournalNode to handle editlog produced by new release with future 
 layoutversion
 -

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, 
 HDFS-6038.008.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file. The problem is that the JN hard-codes the 
 version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it 
 generates incorrect edit logs when the newer release bumps the 
 {{NameNodeLayoutVersion}} during rolling upgrade.
 In the meanwhile, currently JN tries to decode the in-progress editlog 
 segment in order to know the last txid in the segment. In the rolling upgrade 
 scenario, the JN with the old software may not be able to correctly decode 
 the editlog generated by the new software.
 This jira makes the following changes to allow JN to handle editlog produced 
 by software with future layoutversion:
 1. Change the NN--JN startLogSegment RPC signature and let NN specify the 
 layoutversion for the new editlog segment.
 2. Persist a length field for each editlog op to indicate the total length of 
 the op. Instead of calling EditLogFileInputStream#validateEditLog to get the 
 last txid of an in-progress editlog segment, a new method scanEditLog is 
 added and used by JN which does not decode each editlog op but uses the 
 length to quickly jump to the next op.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion

2014-03-13 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


Status: Patch Available  (was: Open)

 Allow JournalNode to handle editlog produced by new release with future 
 layoutversion
 -

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, 
 HDFS-6038.008.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file. The problem is that the JN hard-codes the 
 version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it 
 generates incorrect edit logs when the newer release bumps the 
 {{NameNodeLayoutVersion}} during rolling upgrade.
 In the meanwhile, currently JN tries to decode the in-progress editlog 
 segment in order to know the last txid in the segment. In the rolling upgrade 
 scenario, the JN with the old software may not be able to correctly decode 
 the editlog generated by the new software.
 This jira makes the following changes to allow JN to handle editlog produced 
 by software with future layoutversion:
 1. Change the NN--JN startLogSegment RPC signature and let NN specify the 
 layoutversion for the new editlog segment.
 2. Persist a length field for each editlog op to indicate the total length of 
 the op. Instead of calling EditLogFileInputStream#validateEditLog to get the 
 last txid of an in-progress editlog segment, a new method scanEditLog is 
 added and used by JN which does not decode each editlog op but uses the 
 length to quickly jump to the next op.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold

2014-03-13 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934384#comment-13934384
 ] 

Jing Zhao commented on HDFS-6094:
-

I can also reproduce the issue on my local machine. Looks like the issue is:
1. After the standby NN restarts, DN1 sends first the incremental block report 
then the complete block report to SBN.
2. DN2 sends the incremental block report to SBN. This block report will not 
change the replica number in SBN because the corresponding storage ID has not 
been added in SBN yet (the storage ID will only be added during the full block 
report processing). However, the SBN still checks the current live replica 
number (which is 1 because SBN already received the full block report from DN1) 
and use the number to update the safe block count.

So maybe a simple fix can be:
{code}
@@ -2277,7 +2277,7 @@ private Block addStoredBlock(final BlockInfo block,
 if(storedBlock.getBlockUCState() == BlockUCState.COMMITTED 
 numLiveReplicas = minReplication) {
   storedBlock = completeBlock(bc, storedBlock, false);
-} else if (storedBlock.isComplete()) {
+} else if (storedBlock.isComplete()  added) {
   // check whether safe replication is reached for the block
   // only complete blocks are counted towards that
   // Is no-op if not in safe mode.
{code}

 The same block can be counted twice towards safe mode threshold
 ---

 Key: HDFS-6094
 URL: https://issues.apache.org/jira/browse/HDFS-6094
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal

 {{BlockManager#addStoredBlock}} can cause the same block can be counted 
 towards safe mode threshold. We see this manifest via 
 {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More 
 details to follow in a comment.
 Exception details:
 {code}
   Time elapsed: 12.874 sec   FAILURE!
 java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported 
 blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of 
 live datanodes 3 has reached the minimum number 0. Safe mode will be turned 
 off automatically in 28 seconds.'
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.assertTrue(Assert.java:43)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold

2014-03-13 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934407#comment-13934407
 ] 

Jing Zhao commented on HDFS-6094:
-

Another option is to add new storage id even for incremental block report. 
[~arpitagarwal], what do you think?

 The same block can be counted twice towards safe mode threshold
 ---

 Key: HDFS-6094
 URL: https://issues.apache.org/jira/browse/HDFS-6094
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal

 {{BlockManager#addStoredBlock}} can cause the same block can be counted 
 towards safe mode threshold. We see this manifest via 
 {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More 
 details to follow in a comment.
 Exception details:
 {code}
   Time elapsed: 12.874 sec   FAILURE!
 java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported 
 blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of 
 live datanodes 3 has reached the minimum number 0. Safe mode will be turned 
 off automatically in 28 seconds.'
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.assertTrue(Assert.java:43)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6094) The same block can be counted twice towards safe mode threshold

2014-03-13 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6094:


Attachment: TestHASafeMode-output.txt

Attach the log of the test that reproduced the failure. I injected an exception 
for each increment of safe block count.

 The same block can be counted twice towards safe mode threshold
 ---

 Key: HDFS-6094
 URL: https://issues.apache.org/jira/browse/HDFS-6094
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: TestHASafeMode-output.txt


 {{BlockManager#addStoredBlock}} can cause the same block can be counted 
 towards safe mode threshold. We see this manifest via 
 {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More 
 details to follow in a comment.
 Exception details:
 {code}
   Time elapsed: 12.874 sec   FAILURE!
 java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported 
 blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of 
 live datanodes 3 has reached the minimum number 0. Safe mode will be turned 
 off automatically in 28 seconds.'
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.assertTrue(Assert.java:43)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold

2014-03-13 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934472#comment-13934472
 ] 

Jing Zhao commented on HDFS-6094:
-

Maybe another issue with the current code is that when an incremental block 
report comes before the full block report, if the stored block state is 
COMMITTED, we may increase the safemode total block number while not increase 
the safe block count. In that case I'm not sure if the NN can get stuck in the 
safemode.

 The same block can be counted twice towards safe mode threshold
 ---

 Key: HDFS-6094
 URL: https://issues.apache.org/jira/browse/HDFS-6094
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: TestHASafeMode-output.txt


 {{BlockManager#addStoredBlock}} can cause the same block can be counted 
 towards safe mode threshold. We see this manifest via 
 {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More 
 details to follow in a comment.
 Exception details:
 {code}
   Time elapsed: 12.874 sec   FAILURE!
 java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported 
 blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of 
 live datanodes 3 has reached the minimum number 0. Safe mode will be turned 
 off automatically in 28 seconds.'
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.assertTrue(Assert.java:43)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold

2014-03-14 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935296#comment-13935296
 ] 

Jing Zhao commented on HDFS-6094:
-

The patch looks good to me. One question is, currently NN adds info about a new 
datanode storage only when processing complete block report. Can we also do 
this for IBR?

 The same block can be counted twice towards safe mode threshold
 ---

 Key: HDFS-6094
 URL: https://issues.apache.org/jira/browse/HDFS-6094
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: HDFS-6904.01.patch, TestHASafeMode-output.txt


 {{BlockManager#addStoredBlock}} can cause the same block can be counted 
 towards safe mode threshold. We see this manifest via 
 {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More 
 details to follow in a comment.
 Exception details:
 {code}
   Time elapsed: 12.874 sec   FAILURE!
 java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported 
 blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of 
 live datanodes 3 has reached the minimum number 0. Safe mode will be turned 
 off automatically in 28 seconds.'
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.assertTrue(Assert.java:43)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6100) DataNodeWebHdfsMethods does not failover in HA mode

2014-03-14 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935411#comment-13935411
 ] 

Jing Zhao commented on HDFS-6100:
-

The patch looks pretty good to me. Some minor comments:
# In DatanodeWebHdfsMethods, the current patch has some inconsistent field name 
for the NamenodeAddressParam parameter (nnId, namenodeId, and 
namenodeRpcAddress). How about just calling them namenode since it can be 
either NameService ID or NameNode RPC address?
# Nit: the following code needs some reformat:
{code}
tokenServiceName = HAUtil.isHAEnabled(conf,
nsId) ? nsId : NetUtils.getHostPortString
(rpcServer.getRpcAddress());
{code}
# In the new unit test, we can add some extra check about the content of the 
new created file. Also, maybe we can try to transition the second NN to active 
first so that the first create call can also hit a failover.
# Looks like the patch also fixes the token service name in HA setup for 
webhdfs. Please update the description of the jira.
# Could you also post your system test results (HA, non-HA, secured, insecure 
setup etc.)?

 DataNodeWebHdfsMethods does not failover in HA mode
 ---

 Key: HDFS-6100
 URL: https://issues.apache.org/jira/browse/HDFS-6100
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Haohui Mai
 Attachments: HDFS-6100.000.patch


 In {{DataNodeWebHdfsMethods}}, the code creates a {{DFSClient}} to connect to 
 the NN, so that it can access the files in the cluster. 
 {{DataNodeWebHdfsMethods}} relies on the address passed in the URL to locate 
 the NN. Currently the parameter is set by the NN and it is a host-ip pair, 
 which does not support HA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold

2014-03-17 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938081#comment-13938081
 ] 

Jing Zhao commented on HDFS-6094:
-

The latest patch looks good to me. +1.

 The same block can be counted twice towards safe mode threshold
 ---

 Key: HDFS-6094
 URL: https://issues.apache.org/jira/browse/HDFS-6094
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: HDFS-6094.03.patch, HDFS-6904.01.patch, 
 TestHASafeMode-output.txt


 {{BlockManager#addStoredBlock}} can cause the same block can be counted 
 towards safe mode threshold. We see this manifest via 
 {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More 
 details to follow in a comment.
 Exception details:
 {code}
   Time elapsed: 12.874 sec   FAILURE!
 java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported 
 blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of 
 live datanodes 3 has reached the minimum number 0. Safe mode will be turned 
 off automatically in 28 seconds.'
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.assertTrue(Assert.java:43)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
 at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6090) Use MiniDFSCluster.Builder instead of deprecated constructors

2014-03-17 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938182#comment-13938182
 ] 

Jing Zhao commented on HDFS-6090:
-

Thanks for the cleanup, Akira! The patch looks good to me. +1 pending Jenkins.

 Use MiniDFSCluster.Builder instead of deprecated constructors
 -

 Key: HDFS-6090
 URL: https://issues.apache.org/jira/browse/HDFS-6090
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 2.3.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Minor
  Labels: newbie
 Attachments: HDFS-6090.patch


 Some test classes are using deprecated constructors such as 
 {{MiniDFSCluster(Configuration, int, boolean, String[], String[])}} for 
 building a MiniDFSCluster.
 These classes should use {{MiniDFSCluster.Builder}} to reduce javac warnings 
 and improve code readability.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6090) Use MiniDFSCluster.Builder instead of deprecated constructors

2014-03-17 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6090:


   Resolution: Fixed
Fix Version/s: 2.4.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've committed the patch to trunk, branch-2, and branch-2.4. Thanks [~ajisakaa] 
for the contribution.

 Use MiniDFSCluster.Builder instead of deprecated constructors
 -

 Key: HDFS-6090
 URL: https://issues.apache.org/jira/browse/HDFS-6090
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 2.3.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Minor
  Labels: newbie
 Fix For: 2.4.0

 Attachments: HDFS-6090.patch


 Some test classes are using deprecated constructors such as 
 {{MiniDFSCluster(Configuration, int, boolean, String[], String[])}} for 
 building a MiniDFSCluster.
 These classes should use {{MiniDFSCluster.Builder}} to reduce javac warnings 
 and improve code readability.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6113) Rolling upgrae exception

2014-03-17 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938811#comment-13938811
 ] 

Jing Zhao commented on HDFS-6113:
-

Hi Fengdong, thanks for testing. But hadoop 2.3 does not support rolling 
upgrade... And HA upgrade support also starts only from 2.4. Also, please check 
the document for rolling upgrade detailed steps.

 Rolling upgrae exception
 

 Key: HDFS-6113
 URL: https://issues.apache.org/jira/browse/HDFS-6113
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Fengdong Yu

 I've a hadoop-2.3 running non-securable on the cluster. then I built a trunk 
 instance, also non securable.
 NN1 - active
 NN2 - standby
 DN1 - datanode 
 DN2 - datanode
 JN1,JN2,JN3 - Journal and ZK
 then on the NN2:
 {code}
 hadoop-dameon.sh stop namenode
 hadoop-dameon.sh stop zkfc
 {code}
 then:
 change the environment variables to the new hadoop.(trunk version)
 then:
 {code}
 hadoop-dameon.sh start namenode
 {code}
 NN2 throws exception:
 {code}
 org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not journal 
 CTime for one more JournalNodes. 1 exceptions thrown:
 10.100.91.33:8485: Failed on local exception: java.io.EOFException; Host 
 Details : local host is: 10-204-8-136/10.204.8.136; destination host is: 
 jn33.com:8485;
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.getJournalCTime(QuorumJournalManager.java:631)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.getSharedLogCTime(FSEditLog.java:1383)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:738)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:600)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:360)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:258)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:894)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:653)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:444)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:500)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:656)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:641)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1294)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1360)
 {code}
 JN throws Exception:
 {code}
 2014-03-18 12:19:01,960 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8485: readAndProcess threw exception java.io.IOException: Unable 
 to read authentication method from client 10.204.8.136. Count of bytes read: 0
 java.io.IOException: Unable to read authentication method
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1344)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:761)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:560)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:535)
 2014-03-18 12:19:01,960 DEBUG org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8485: disconnecting client 10.204.8.136:39063. Number of active 
 connections: 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (HDFS-6113) Rolling upgrae exception

2014-03-17 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938811#comment-13938811
 ] 

Jing Zhao edited comment on HDFS-6113 at 3/18/14 4:49 AM:
--

Hi Fengdong, thanks for testing. But hadoop 2.3 does not support rolling 
upgrade... And HA upgrade support only starts from 2.4. Also, please check the 
document for rolling upgrade detailed steps.


was (Author: jingzhao):
Hi Fengdong, thanks for testing. But hadoop 2.3 does not support rolling 
upgrade... And HA upgrade support also starts only from 2.4. Also, please check 
the document for rolling upgrade detailed steps.

 Rolling upgrae exception
 

 Key: HDFS-6113
 URL: https://issues.apache.org/jira/browse/HDFS-6113
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Fengdong Yu

 I've a hadoop-2.3 running non-securable on the cluster. then I built a trunk 
 instance, also non securable.
 NN1 - active
 NN2 - standby
 DN1 - datanode 
 DN2 - datanode
 JN1,JN2,JN3 - Journal and ZK
 then on the NN2:
 {code}
 hadoop-dameon.sh stop namenode
 hadoop-dameon.sh stop zkfc
 {code}
 then:
 change the environment variables to the new hadoop.(trunk version)
 then:
 {code}
 hadoop-dameon.sh start namenode
 {code}
 NN2 throws exception:
 {code}
 org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not journal 
 CTime for one more JournalNodes. 1 exceptions thrown:
 10.100.91.33:8485: Failed on local exception: java.io.EOFException; Host 
 Details : local host is: 10-204-8-136/10.204.8.136; destination host is: 
 jn33.com:8485;
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.getJournalCTime(QuorumJournalManager.java:631)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.getSharedLogCTime(FSEditLog.java:1383)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:738)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:600)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:360)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:258)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:894)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:653)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:444)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:500)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:656)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:641)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1294)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1360)
 {code}
 JN throws Exception:
 {code}
 2014-03-18 12:19:01,960 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8485: readAndProcess threw exception java.io.IOException: Unable 
 to read authentication method from client 10.204.8.136. Count of bytes read: 0
 java.io.IOException: Unable to read authentication method
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1344)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:761)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:560)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:535)
 2014-03-18 12:19:01,960 DEBUG org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8485: disconnecting client 10.204.8.136:39063. Number of active 
 connections: 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-18 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938931#comment-13938931
 ] 

Jing Zhao commented on HDFS-6089:
-

Thanks for the comments, Andrew and Todd!

bq. In EditLogTailer#doTailEdits, I believe that rolling the edit log right 
before is intended to freshen up the edit log for consumption by the SbNN.
But in the currently code, the auto trigger is still running periodically, 
which means we cannot guarantee that we roll the editlog before we call 
doTailEdits. During the failover, we call editLog.recoverUnclosedStreams() and 
EditLogTailer#catchupDuringFailover in FSNamesystem#startActiveServices to 
guarantee the SBN can tail all the editlog. But before failover, if we can make 
the autoroller on the active NN more aggressive (as you suggested), we can 
still guarantee that the SBN will not do a lot of replay on a failover. What do 
you think?

bq. we'll need to update its check period and thresholds to be more aggressive.
Yes, agree. We should assign a smaller value to the sleep interval (maybe 2min 
just like the SBN).

bq. Maybe we should just have a shorter timeout on the rollEditLog call. Or 
somehow..
We can also do this. But to have two auto roller working in two NN at the same 
time still seems not that necessary to me..

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HDFS-6113) Rolling upgrae exception

2014-03-18 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao resolved HDFS-6113.
-

Resolution: Invalid

Based on Kihwal and Nicholas's comments, let's close this jira first. Fengdong, 
thanks for the testing, and please feel free to open new jiras if you think 
there are other issues.

 Rolling upgrae exception
 

 Key: HDFS-6113
 URL: https://issues.apache.org/jira/browse/HDFS-6113
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Fengdong Yu

 I've a hadoop-2.3 running non-securable on the cluster. then I built a trunk 
 instance, also non securable.
 NN1 - active
 NN2 - standby
 DN1 - datanode 
 DN2 - datanode
 JN1,JN2,JN3 - Journal and ZK
 then on the NN2:
 {code}
 hadoop-dameon.sh stop namenode
 hadoop-dameon.sh stop zkfc
 {code}
 then:
 change the environment variables to the new hadoop.(trunk version)
 then:
 {code}
 hadoop-dameon.sh start namenode
 {code}
 NN2 throws exception:
 {code}
 org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not journal 
 CTime for one more JournalNodes. 1 exceptions thrown:
 10.100.91.33:8485: Failed on local exception: java.io.EOFException; Host 
 Details : local host is: 10-204-8-136/10.204.8.136; destination host is: 
 jn33.com:8485;
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.getJournalCTime(QuorumJournalManager.java:631)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.getSharedLogCTime(FSEditLog.java:1383)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:738)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:600)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:360)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:258)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:894)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:653)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:444)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:500)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:656)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:641)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1294)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1360)
 {code}
 JN throws Exception:
 {code}
 2014-03-18 12:19:01,960 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8485: readAndProcess threw exception java.io.IOException: Unable 
 to read authentication method from client 10.204.8.136. Count of bytes read: 0
 java.io.IOException: Unable to read authentication method
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1344)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:761)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:560)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:535)
 2014-03-18 12:19:01,960 DEBUG org.apache.hadoop.ipc.Server: IPC Server 
 listener on 8485: disconnecting client 10.204.8.136:39063. Number of active 
 connections: 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-18 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939848#comment-13939848
 ] 

Jing Zhao commented on HDFS-6089:
-

Thanks for the response, Andrew. 

bq. If we add a time threshold (like the tailer), we want to avoid the reverse 
problem: a lot of small segments accumulating in the absence of a standby.
Could you please explain how we avoid this issue with the current strategy?
For the autoroller in ANN, I guess it should still determine whether to roll 
based on the # edits, however, we should change its sleeping interval from 5min 
to a smaller number (e.g., 2min), which means it will come to check the edits # 
every 2min and roll edits if necessary. Can this address your concern? Or am I 
missing something here?

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-19 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940229#comment-13940229
 ] 

Jing Zhao commented on HDFS-6089:
-

Hi Andrew, thanks for the explanation. I guess I understand your concern now: 
only rolling on ANN based on edits # may cause issue in some scenario. This is 
because if we don't have further operations it is possible that SBN will wait a 
long time to tail that part of edits which is in an in-progress segment.

bq. Checkpointing combines the edit log with the fsimage, and we purge 
unnecessary log segments afterwards.
But I'm still a little confused about this part. I fail to see the difference 
of the based-on-time rolling from SBN and ANN. In the current code, SBN 
triggers rolling still through RPC to ANN. Also this does not affect 
checkpointing and purging: when SBN does a checkpoint, both SBN and ANN will 
purge old edits in their own storage (SBN does the purging before uploading the 
checkpoint, and ANN does it after getting the new fsimage).

So I guess a possible solution may be: just letting ANN does rolling every 
2min. I think this can achieve almost the same effect as the current mechanism, 
without delaying the failover. Or you see some counter examples with this 
change?

Back to the changing the rpc timeout solution. Looks like we have not set 
timeout for this NN--NN rpc right now (correct me if I'm wrong). Setting a 
timeout (e.g., 20s just like the default timeout from client to NN) of course 
can improve the failover time in our test case, but I still prefer the above 
solution because it makes the rolling behavior simpler and more predictable 
(especially it removes the rpc call from SBN to ANN).

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6100) DataNodeWebHdfsMethods does not failover in HA mode

2014-03-19 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6100:


   Resolution: Fixed
Fix Version/s: 2.4.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've committed the patch to trunk, branch-2, and branch-2.4. Thanks [~wheat9] 
for the contribution.

 DataNodeWebHdfsMethods does not failover in HA mode
 ---

 Key: HDFS-6100
 URL: https://issues.apache.org/jira/browse/HDFS-6100
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Haohui Mai
 Fix For: 2.4.0

 Attachments: HDFS-6100.000.patch, HDFS-6100.001.patch


 In {{DataNodeWebHdfsMethods}}, the code creates a {{DFSClient}} to connect to 
 the NN, so that it can access the files in the cluster. 
 {{DataNodeWebHdfsMethods}} relies on the address passed in the URL to locate 
 the NN. This implementation has two problems:
 # The {{DFSClient}} only knows about the current active NN, thus it does not 
 support failover.
 # The delegation token is based on the active NN, therefore the {{DFSClient}} 
 will fail to authenticate of the standby NN in secure HA setup.
 Currently the parameter {{namenoderpcaddress}} in the URL stores the host-ip 
 pair that corresponds to the active NN. To fix this bug, this jira proposes 
 to store the name service id in the parameter in HA setup (yet the parameter 
 stays the same in non-HA setup).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6100) DataNodeWebHdfsMethods does not failover in HA mode

2014-03-19 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940716#comment-13940716
 ] 

Jing Zhao commented on HDFS-6100:
-

+1 for the latest patch.

 DataNodeWebHdfsMethods does not failover in HA mode
 ---

 Key: HDFS-6100
 URL: https://issues.apache.org/jira/browse/HDFS-6100
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Haohui Mai
 Attachments: HDFS-6100.000.patch, HDFS-6100.001.patch


 In {{DataNodeWebHdfsMethods}}, the code creates a {{DFSClient}} to connect to 
 the NN, so that it can access the files in the cluster. 
 {{DataNodeWebHdfsMethods}} relies on the address passed in the URL to locate 
 the NN. This implementation has two problems:
 # The {{DFSClient}} only knows about the current active NN, thus it does not 
 support failover.
 # The delegation token is based on the active NN, therefore the {{DFSClient}} 
 will fail to authenticate of the standby NN in secure HA setup.
 Currently the parameter {{namenoderpcaddress}} in the URL stores the host-ip 
 pair that corresponds to the active NN. To fix this bug, this jira proposes 
 to store the name service id in the parameter in HA setup (yet the parameter 
 stays the same in non-HA setup).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6105) NN web UI for DN list loads the same jmx page multiple times.

2014-03-19 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941088#comment-13941088
 ] 

Jing Zhao commented on HDFS-6105:
-

+1

 NN web UI for DN list loads the same jmx page multiple times.
 -

 Key: HDFS-6105
 URL: https://issues.apache.org/jira/browse/HDFS-6105
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Kihwal Lee
Assignee: Haohui Mai
 Attachments: HDFS-6105.000.patch, datanodes-tab.png


 When loading Datanodes page of the NN web UI, the same jmx query is made 
 multiple times. For a big cluster, that's a lot of data and overhead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-19 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941117#comment-13941117
 ] 

Jing Zhao commented on HDFS-6089:
-

bq. This is because if we don't have further operations it is possible that SBN 
will wait a long time to tail that part of edits which is in an in-progress 
segment.
bq. In this scenario, the ANN will keep rolling every 2mins, generating a lot 
of edit log segments that aren't being cleared out.
Hmm, actually my thought yesterday was not correct. Yes, we cannot do auto 
rolling simply based on time, and the reason is just like [~andrew.wang] 
pointed out.

Hopefully this is my last question, just want to make sure: the current SBN 
auto roller can cause the same issue like a lot of edit log segments aren't 
being cleared out in case that checkpoints are broken (but the SBN is not 
down), right?

Anyway I will post a patch to add rpc timeout later. 

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6127) WebHDFS tokens cannot be renewed in HA setup

2014-03-19 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941149#comment-13941149
 ] 

Jing Zhao commented on HDFS-6127:
-

The patch looks good to me. Some comments:
# Nit: HAUtil#getServiceUriFromToken and TokenManager#getInstance need format.
# javadoc of HAUtil#getServiceUriFromToken needs to be updated after the 
change: the method now can support URI of other FS, not just HDFS.
# The change in TestWebhdfsForHA actually weakens our unit test, I think. We 
still need the fs.renew and fs.cancel for regression of HDFS-5339. A separate 
unit test with some code copy should be fine here, I guess.
{code}
-  fs.renewDelegationToken(token);
-  fs.cancelDelegationToken(token);
+  token.renew(conf);
+  token.cancel(conf);
{code}

+1 after addressing the comments.

 WebHDFS tokens cannot be renewed in HA setup
 

 Key: HDFS-6127
 URL: https://issues.apache.org/jira/browse/HDFS-6127
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Haohui Mai
 Attachments: HDFS-6127.000.patch


 {{TokenAspect}} class assumes that the service name of the token is alway a 
 host-ip pair. In HA setup, however, the service name becomes the name service 
 id, which breaks the assumption. As a result, WebHDFS tokens cannot be 
 renewed in HA setup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6127) WebHDFS tokens cannot be renewed in HA setup

2014-03-19 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941171#comment-13941171
 ] 

Jing Zhao commented on HDFS-6127:
-

bq. The change in TestWebhdfsForHA actually weakens our unit test, I think
Actually this will not, since in the end the fs.renew and fs.cancel will still 
be called. Thus please just ignore this comment.

 WebHDFS tokens cannot be renewed in HA setup
 

 Key: HDFS-6127
 URL: https://issues.apache.org/jira/browse/HDFS-6127
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Haohui Mai
 Attachments: HDFS-6127.000.patch


 {{TokenAspect}} class assumes that the service name of the token is alway a 
 host-ip pair. In HA setup, however, the service name becomes the name service 
 id, which breaks the assumption. As a result, WebHDFS tokens cannot be 
 renewed in HA setup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

2014-03-19 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6089:


Attachment: HDFS-6089.002.patch

Patch that adds rpc timeout for the rollEditLog call. I set the default timeout 
to 20s.

 Standby NN while transitioning to active throws a connection refused error 
 when the prior active NN process is suspended
 

 Key: HDFS-6089
 URL: https://issues.apache.org/jira/browse/HDFS-6089
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Jing Zhao
 Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch, 
 HDFS-6089.002.patch


 The following scenario was tested:
 * Determine Active NN and suspend the process (kill -19)
 * Wait about 60s to let the standby transition to active
 * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
 active.
 What was noticed that some times the call to get the service state of nn2 got 
 a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6120) Cleanup safe mode error messages

2014-03-20 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942052#comment-13942052
 ] 

Jing Zhao commented on HDFS-6120:
-

Thanks for the cleanup, Arpit! The patch looks good to me. Some minors:
# I guess here we should use if (needEnter()) ?
{code}
+  if (!needEnter()) {
+reportStatus(STATE* Safe mode ON, thresholds not met., false);
+  }
{code}
# For getTurnOffTip(), do you think we need an extra msg to report the value of 
reached and the corresponding status (safe mode off, or on, or in extension)? 
After the NN comes into the safemode extension, the safe block count may still 
fall below the threshold.

 Cleanup safe mode error messages
 

 Key: HDFS-6120
 URL: https://issues.apache.org/jira/browse/HDFS-6120
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.3.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: HDFS-6120.01.patch


 In HA mode the SBN can enter safe-mode extension and stay there even after 
 the extension period has elapsed but continue to return the safemode message 
 stating that The threshold has been reached and safe mode will be turned 
 off soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao moved HADOOP-10415 to HDFS-6131:
--

  Component/s: (was: documentation)
   documentation
Affects Version/s: (was: 2.3.0)
   2.3.0
  Key: HDFS-6131  (was: HADOOP-10415)
  Project: Hadoop HDFS  (was: Hadoop Common)

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao

 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6131:


Attachment: HDFS-6131.000.patch

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6131.000.patch


 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6131:


Status: Open  (was: Patch Available)

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6131.000.patch


 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942128#comment-13942128
 ] 

Jing Zhao commented on HDFS-6131:
-

Upload the patch for branch-2.

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6131.000.patch


 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6131:


Status: Patch Available  (was: Open)

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6131.000.patch


 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942212#comment-13942212
 ] 

Jing Zhao commented on HDFS-6131:
-

Ahh, yes. Thanks Arpit! Let me update the patch.

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6131.000.patch


 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6131:


Attachment: HDFS-6131.001.patch

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-6131.000.patch, HDFS-6131.001.patch


 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HDFS-6131) Move HDFSHighAvailabilityWithNFS.apt.vm and HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS

2014-03-20 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao resolved HDFS-6131.
-

   Resolution: Fixed
Fix Version/s: 2.4.0
 Hadoop Flags: Reviewed

Thanks very much to Arpit for the review! I've committed this to branch-2 and 
branch-2.4.0.

 Move HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm from Yarn to HDFS
 

 Key: HDFS-6131
 URL: https://issues.apache.org/jira/browse/HDFS-6131
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.3.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.4.0

 Attachments: HDFS-6131.000.patch, HDFS-6131.001.patch


 Currently in branch-2, the document HDFSHighAvailabilityWithNFS.apt.vm and 
 HDFSHighAvailabilityWithQJM.apt.vm are still in the Yarn project. We should 
 move them to HDFS just like in trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion

2014-03-20 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942461#comment-13942461
 ] 

Jing Zhao commented on HDFS-6038:
-

Thanks for the review, Nicholas! I will commit the patch shortly.

 Allow JournalNode to handle editlog produced by new release with future 
 layoutversion
 -

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, 
 HDFS-6038.008.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file. The problem is that the JN hard-codes the 
 version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it 
 generates incorrect edit logs when the newer release bumps the 
 {{NameNodeLayoutVersion}} during rolling upgrade.
 In the meanwhile, currently JN tries to decode the in-progress editlog 
 segment in order to know the last txid in the segment. In the rolling upgrade 
 scenario, the JN with the old software may not be able to correctly decode 
 the editlog generated by the new software.
 This jira makes the following changes to allow JN to handle editlog produced 
 by software with future layoutversion:
 1. Change the NN--JN startLogSegment RPC signature and let NN specify the 
 layoutversion for the new editlog segment.
 2. Persist a length field for each editlog op to indicate the total length of 
 the op. Instead of calling EditLogFileInputStream#validateEditLog to get the 
 last txid of an in-progress editlog segment, a new method scanEditLog is 
 added and used by JN which does not decode each editlog op but uses the 
 length to quickly jump to the next op.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6038) Allow JournalNode to handle editlog produced by new release with future layoutversion

2014-03-20 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6038:


   Resolution: Fixed
Fix Version/s: 2.4.0
   Status: Resolved  (was: Patch Available)

I've committed the patch to trunk, branch-2, and branch-2.4. Thanks Nicholas 
and Todd for the review!

 Allow JournalNode to handle editlog produced by new release with future 
 layoutversion
 -

 Key: HDFS-6038
 URL: https://issues.apache.org/jira/browse/HDFS-6038
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: journal-node, namenode
Reporter: Haohui Mai
Assignee: Jing Zhao
 Fix For: 2.4.0

 Attachments: HDFS-6038.000.patch, HDFS-6038.001.patch, 
 HDFS-6038.002.patch, HDFS-6038.003.patch, HDFS-6038.004.patch, 
 HDFS-6038.005.patch, HDFS-6038.006.patch, HDFS-6038.007.patch, 
 HDFS-6038.008.patch, editsStored


 In HA setup, the JNs receive edit logs (blob) from the NN and write into edit 
 log files. In order to write well-formed edit log files, the JNs prepend a 
 header for each edit log file. The problem is that the JN hard-codes the 
 version (i.e., {{NameNodeLayoutVersion}} in the edit log, therefore it 
 generates incorrect edit logs when the newer release bumps the 
 {{NameNodeLayoutVersion}} during rolling upgrade.
 In the meanwhile, currently JN tries to decode the in-progress editlog 
 segment in order to know the last txid in the segment. In the rolling upgrade 
 scenario, the JN with the old software may not be able to correctly decode 
 the editlog generated by the new software.
 This jira makes the following changes to allow JN to handle editlog produced 
 by software with future layoutversion:
 1. Change the NN--JN startLogSegment RPC signature and let NN specify the 
 layoutversion for the new editlog segment.
 2. Persist a length field for each editlog op to indicate the total length of 
 the op. Instead of calling EditLogFileInputStream#validateEditLog to get the 
 last txid of an in-progress editlog segment, a new method scanEditLog is 
 added and used by JN which does not decode each editlog op but uses the 
 length to quickly jump to the next op.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


<    1   2   3   4   5   6   7   8   9   10   >