[jira] [Updated] (HDFS-4827) Slight update to the implementation of API for handling favored nodes in DFSClient

2013-05-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-4827:
-

Hadoop Flags: Reviewed

+1 patch looks good.

 Slight update to the implementation of API for handling favored nodes in 
 DFSClient
 --

 Key: HDFS-4827
 URL: https://issues.apache.org/jira/browse/HDFS-4827
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.0.5-beta
Reporter: Devaraj Das
Assignee: Devaraj Das
 Fix For: 2.0.5-beta

 Attachments: hdfs-4827-1.txt


 Currently, the favoredNodes flavor of the DFSClient.create implementation 
 does a call to _inetSocketAddressInstance.getAddress().getHostAddress()_ This 
 wouldn't work if the inetSocketAddressInstance is unresolved (instance 
 created via InetSocketAddress.createUnresolved()). The DFSClient API should 
 handle both cases of favored-nodes' InetSocketAddresses (resolved/unresolved) 
 passed to it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2013-05-28 Thread Jagane Sundar (JIRA)
Jagane Sundar created HDFS-4858:
---

 Summary: HDFS DataNode to NameNode RPC should timeout
 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.4-alpha, 3.0.0, 2.0.5-beta, 2.0.4.1-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.0.5-beta


The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
14000. This configuration means that the IPC Client (DataNode, in this case) 
should timeout in 14000 seconds if the Standby NameNode does not respond to a 
sendHeartbeat.

What we observe is this: If the Standby NameNode happens to reboot for any 
reason, the DataNodes that are heartbeating to this Standby get stuck forever 
while trying to sendHeartbeat. See Stack trace included below. When the Standby 
NameNode comes back up, we find that the DataNode never re-registers with the 
Standby NameNode. Thereafter failover completely fails.

The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 
seconds, and keep retrying till the Standby NameNode comes back up. When it 
does, the DataNode should reconnect, re-register, and offer service.

Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
create the DatanodeProtocolPB object.

Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:

Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
vmhost6-vm1/10.10.10.151:8020):
  State: WAITING
  Blocked count: 23843
  Waited count: 45676
  Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
  Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.ipc.Client.call(Client.java:1220)

org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)

org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)

org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)

org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)

org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)

org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)

org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
java.lang.Thread.run(Thread.java:662)

DataNode RPC to Standby NameNode never times out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4832) Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave

2013-05-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668318#comment-13668318
 ] 

Kihwal Lee commented on HDFS-4832:
--

Here are some comments:
* The condition for figuring non-initial safe mode: HA state is already checked 
in the namenode method, so you don't have to check.
* isInStartupSafeMode() returns true for any auto safe mode. E.g. if the 
resource checker puts NN in safe mode, it will return true.
* The existing code drained scheduled work in safe mode, but the patch makes it 
immediately stops sending scheduled work to DNs. This seems correct behavior 
for safe mode, but those work can be sent out after leaving safe mode. That may 
not be ideal.  E.g. if NN is suffering from a flaky DNS, DNs will appear dead, 
come back and dead again, generating a lot of invalidation and replication 
work. Admins may put NN in safe mode to safely pass the storm. When they do, 
the unnecessary work needs to stop rather than being delayed.  Please make sure 
unintended damage does not occur after leaving safe mode.


 Namenode doesn't change the number of missing blocks in safemode when DNs 
 rejoin or leave
 -

 Key: HDFS-4832
 URL: https://issues.apache.org/jira/browse/HDFS-4832
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Ravi Prakash
Assignee: Ravi Prakash
Priority: Critical
 Attachments: HDFS-4832.patch, HDFS-4832.patch


 Courtesy Karri VRK Reddy!
 {quote}
 1. Namenode lost datanodes causing missing blocks
 2. Namenode was put in safe mode
 3. Datanode restarted on dead nodes 
 4. Waited for lots of time for the NN UI to reflect the recovered blocks.
 5. Forced NN out of safe mode and suddenly,  no more missing blocks anymore.
 {quote}
 I was able to replicate this on 0.23 and trunk. I set 
 dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate 
 lost datanode. The opposite case also has problems (i.e. Datanode failing 
 when NN is in safemode, doesn't lead to a missing blocks message)
 Without the NN updating this list of missing blocks, the grid admins will not 
 know when to take the cluster out of safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.

2013-05-28 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668329#comment-13668329
 ] 

Matthew Farrellee commented on HDFS-4849:
-

As a member of the community trying to create a FileSystem implementation, I 
view these proposed changes as significant deviations from the semantics that 
are being described as part of HADOOP-9371. The changes will better some use 
cases while worsening others. The ability to implement them across all 
FileSystems will also vary dramatically.

Please discuss the possibility of FileSystems optionally implementing these 
enhanced semantics as part of HADOOP-9371, and do not add them to a 2.0.X.

 Idempotent create, append and delete operations.
 

 Key: HDFS-4849
 URL: https://issues.apache.org/jira/browse/HDFS-4849
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Konstantin Shvachko
Assignee: Konstantin Shvachko

 create, append and delete operations can be made idempotent. This will reduce 
 chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4754) Add an API in the namenode to mark a datanode as stale

2013-05-28 Thread Nicolas Liochon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Liochon updated HDFS-4754:
--

Status: Open  (was: Patch Available)

 Add an API in the namenode to mark a datanode as stale
 --

 Key: HDFS-4754
 URL: https://issues.apache.org/jira/browse/HDFS-4754
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Critical
 Attachments: 4754.v1.patch


 There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
 timeout, defaulted to 30s.
 There are two reasons to add an API to mark a node as stale even if the 
 timeout is not yet reached:
  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
 we sometimes start the recovery before a node is marked staled. (even with 
 reasonable settings as: stale: 20s; HBase ZK timeout: 30s
  2) Some third parties could detect that a node is dead before the timeout, 
 hence saving us the cost of retrying. An example or such hw is Arista, 
 presented here by [~tsuna] 
 http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
 HBASE-6290.
 As usual, even if the node is dead it can comeback before the 10 minutes 
 limit. So I would propose to set a timebound. The API would be
 namenode.markStale(String ipAddress, int port, long durationInMs);
 After durationInMs, the namenode would again rely only on its heartbeat to 
 decide.
 Thoughts?
 If there is no objections, and if nobody in the hdfs dev team has the time to 
 spend some time on it, I will give it a try for branch 2  3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4754) Add an API in the namenode to mark a datanode as stale

2013-05-28 Thread Nicolas Liochon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Liochon updated HDFS-4754:
--

Status: Patch Available  (was: Open)

 Add an API in the namenode to mark a datanode as stale
 --

 Key: HDFS-4754
 URL: https://issues.apache.org/jira/browse/HDFS-4754
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Critical
 Attachments: 4754.v1.patch, 4754.v2.patch


 There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
 timeout, defaulted to 30s.
 There are two reasons to add an API to mark a node as stale even if the 
 timeout is not yet reached:
  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
 we sometimes start the recovery before a node is marked staled. (even with 
 reasonable settings as: stale: 20s; HBase ZK timeout: 30s
  2) Some third parties could detect that a node is dead before the timeout, 
 hence saving us the cost of retrying. An example or such hw is Arista, 
 presented here by [~tsuna] 
 http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
 HBASE-6290.
 As usual, even if the node is dead it can comeback before the 10 minutes 
 limit. So I would propose to set a timebound. The API would be
 namenode.markStale(String ipAddress, int port, long durationInMs);
 After durationInMs, the namenode would again rely only on its heartbeat to 
 decide.
 Thoughts?
 If there is no objections, and if nobody in the hdfs dev team has the time to 
 spend some time on it, I will give it a try for branch 2  3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4754) Add an API in the namenode to mark a datanode as stale

2013-05-28 Thread Nicolas Liochon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Liochon updated HDFS-4754:
--

Attachment: 4754.v2.patch

 Add an API in the namenode to mark a datanode as stale
 --

 Key: HDFS-4754
 URL: https://issues.apache.org/jira/browse/HDFS-4754
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Critical
 Attachments: 4754.v1.patch, 4754.v2.patch


 There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
 timeout, defaulted to 30s.
 There are two reasons to add an API to mark a node as stale even if the 
 timeout is not yet reached:
  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
 we sometimes start the recovery before a node is marked staled. (even with 
 reasonable settings as: stale: 20s; HBase ZK timeout: 30s
  2) Some third parties could detect that a node is dead before the timeout, 
 hence saving us the cost of retrying. An example or such hw is Arista, 
 presented here by [~tsuna] 
 http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
 HBASE-6290.
 As usual, even if the node is dead it can comeback before the 10 minutes 
 limit. So I would propose to set a timebound. The API would be
 namenode.markStale(String ipAddress, int port, long durationInMs);
 After durationInMs, the namenode would again rely only on its heartbeat to 
 decide.
 Thoughts?
 If there is no objections, and if nobody in the hdfs dev team has the time to 
 spend some time on it, I will give it a try for branch 2  3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4754) Add an API in the namenode to mark a datanode as stale

2013-05-28 Thread Nicolas Liochon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668338#comment-13668338
 ] 

Nicolas Liochon commented on HDFS-4754:
---

v2 takes the comments above into account. If the method is called with a 
duration of zero, we use the configuration stale node duration.

 Add an API in the namenode to mark a datanode as stale
 --

 Key: HDFS-4754
 URL: https://issues.apache.org/jira/browse/HDFS-4754
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Critical
 Attachments: 4754.v1.patch, 4754.v2.patch


 There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
 timeout, defaulted to 30s.
 There are two reasons to add an API to mark a node as stale even if the 
 timeout is not yet reached:
  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
 we sometimes start the recovery before a node is marked staled. (even with 
 reasonable settings as: stale: 20s; HBase ZK timeout: 30s
  2) Some third parties could detect that a node is dead before the timeout, 
 hence saving us the cost of retrying. An example or such hw is Arista, 
 presented here by [~tsuna] 
 http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
 HBASE-6290.
 As usual, even if the node is dead it can comeback before the 10 minutes 
 limit. So I would propose to set a timebound. The API would be
 namenode.markStale(String ipAddress, int port, long durationInMs);
 After durationInMs, the namenode would again rely only on its heartbeat to 
 decide.
 Thoughts?
 If there is no objections, and if nobody in the hdfs dev team has the time to 
 spend some time on it, I will give it a try for branch 2  3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4859) Add timeout in FileJournalManager

2013-05-28 Thread Kihwal Lee (JIRA)
Kihwal Lee created HDFS-4859:


 Summary: Add timeout in FileJournalManager
 Key: HDFS-4859
 URL: https://issues.apache.org/jira/browse/HDFS-4859
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Affects Versions: 2.0.4-alpha
Reporter: Kihwal Lee


Due to absence of explicit timeout in FileJournalManager, error conditions that 
incur long delay (usually until driver timeout) can make namenode unresponsive 
for long time. This directly affects NN's failure detection latency, which is 
critical in HA.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4696) Branch 0.23 Patch for Block Replication Policy Implementation May Skip Higher-Priority Blocks for Lower-Priority Blocks

2013-05-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated HDFS-4696:


Target Version/s: 0.23.9  (was: 0.23.8)

 Branch 0.23 Patch for Block Replication Policy Implementation May Skip 
 Higher-Priority Blocks for Lower-Priority Blocks
 -

 Key: HDFS-4696
 URL: https://issues.apache.org/jira/browse/HDFS-4696
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.5
Reporter: Derek Dagit
Assignee: Derek Dagit

 This JIRA tracks the solution to HDFS-4366 for the 0.23 branch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4576) Webhdfs authentication issues

2013-05-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated HDFS-4576:


Target Version/s: 3.0.0, 2.0.5-beta, 0.23.9  (was: 3.0.0, 2.0.5-beta, 
0.23.8)

 Webhdfs authentication issues
 -

 Key: HDFS-4576
 URL: https://issues.apache.org/jira/browse/HDFS-4576
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Affects Versions: 2.0.0-alpha, 3.0.0, 0.23.7
Reporter: Daryn Sharp
Assignee: Daryn Sharp

 Umbrella jira to track the webhdfs authentication issues as subtasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4587) Webhdfs secure clients are incompatible with non-secure NN

2013-05-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated HDFS-4587:


Target Version/s: 3.0.0, 2.0.5-beta, 0.23.9  (was: 3.0.0, 2.0.5-beta, 
0.23.8)

 Webhdfs secure clients are incompatible with non-secure NN
 --

 Key: HDFS-4587
 URL: https://issues.apache.org/jira/browse/HDFS-4587
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode, webhdfs
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Daryn Sharp

 A secure webhdfs client will receive an exception from a non-secure NN.  For 
 a NN in non-secure mode, {{FSNamesystem#getDelegationToken}} will return 
 null to indicate no token is required.  Hdfs will send back the null to the 
 client, but webhdfs uses {{DelegationTokenSecretManager.createCredentials}} 
 which instead throws an exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4780) Use the correct relogin method for services

2013-05-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668381#comment-13668381
 ] 

Kihwal Lee commented on HDFS-4780:
--

+1 the patch looks good. 

 Use the correct relogin method for services
 ---

 Key: HDFS-4780
 URL: https://issues.apache.org/jira/browse/HDFS-4780
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8
Reporter: Kihwal Lee
Assignee: Robert Parker
Priority: Minor
 Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch


 A number of components call reloginFromKeytab() before making requests. For 
 StandbyCheckpointer and SecondaryNameNode, where this can be called 
 frequently, it generates many WARN messages like this:
 WARN security.UserGroupInformation: Not attempting to re-login since the last 
 re-login was attempted less than 600 seconds before.
 Other than these messages, it doesn't do anything wrong. But it will be nice 
 if it is changed to call checkTGTAndReloginFromKeytab() to avoid the 
 potentially misleading WARN messages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4780) Use the correct relogin method for services

2013-05-28 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-4780:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks for working on the patch, Rob.


 Use the correct relogin method for services
 ---

 Key: HDFS-4780
 URL: https://issues.apache.org/jira/browse/HDFS-4780
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8
Reporter: Kihwal Lee
Assignee: Robert Parker
Priority: Minor
 Fix For: 3.0.0, 2.0.5-beta

 Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch


 A number of components call reloginFromKeytab() before making requests. For 
 StandbyCheckpointer and SecondaryNameNode, where this can be called 
 frequently, it generates many WARN messages like this:
 WARN security.UserGroupInformation: Not attempting to re-login since the last 
 re-login was attempted less than 600 seconds before.
 Other than these messages, it doesn't do anything wrong. But it will be nice 
 if it is changed to call checkTGTAndReloginFromKeytab() to avoid the 
 potentially misleading WARN messages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4780) Use the correct relogin method for services

2013-05-28 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-4780:
-

Fix Version/s: 2.0.5-beta
   3.0.0
 Hadoop Flags: Reviewed

I've committed this to trunk and branch-2.

 Use the correct relogin method for services
 ---

 Key: HDFS-4780
 URL: https://issues.apache.org/jira/browse/HDFS-4780
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8
Reporter: Kihwal Lee
Assignee: Robert Parker
Priority: Minor
 Fix For: 3.0.0, 2.0.5-beta

 Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch


 A number of components call reloginFromKeytab() before making requests. For 
 StandbyCheckpointer and SecondaryNameNode, where this can be called 
 frequently, it generates many WARN messages like this:
 WARN security.UserGroupInformation: Not attempting to re-login since the last 
 re-login was attempted less than 600 seconds before.
 Other than these messages, it doesn't do anything wrong. But it will be nice 
 if it is changed to call checkTGTAndReloginFromKeytab() to avoid the 
 potentially misleading WARN messages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4780) Use the correct relogin method for services

2013-05-28 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668387#comment-13668387
 ] 

Daryn Sharp commented on HDFS-4780:
---

The change to the callers to invoke the check version seems ok.

It looks like the more appropriate UGI change may be to modify by 
{{reloginFromKeytab}} to switch the order of {{hasSufficientTimeElapsed}} 
(which generates the warning) and {{getRefreshTime}} which shorts out the renew 
attempt.

On a related note, the two relogin methods appear equivalent aside from the 
{{hasSufficientTimeElapsed}} check.  It would seem that only 
{{checkTGTAndReloginFromKeytab}} should be checking {{getRefreshTime}}.  
{{reloginFromKeytab}} should be probably be unconditionally re-acquiring a TGT, 
hence not checking {{hasSufficientTimeElapsed}}.

 Use the correct relogin method for services
 ---

 Key: HDFS-4780
 URL: https://issues.apache.org/jira/browse/HDFS-4780
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8
Reporter: Kihwal Lee
Assignee: Robert Parker
Priority: Minor
 Fix For: 3.0.0, 2.0.5-beta

 Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch


 A number of components call reloginFromKeytab() before making requests. For 
 StandbyCheckpointer and SecondaryNameNode, where this can be called 
 frequently, it generates many WARN messages like this:
 WARN security.UserGroupInformation: Not attempting to re-login since the last 
 re-login was attempted less than 600 seconds before.
 Other than these messages, it doesn't do anything wrong. But it will be nice 
 if it is changed to call checkTGTAndReloginFromKeytab() to avoid the 
 potentially misleading WARN messages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4780) Use the correct relogin method for services

2013-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668388#comment-13668388
 ] 

Hudson commented on HDFS-4780:
--

Integrated in Hadoop-trunk-Commit #3794 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3794/])
HDFS-4780. Use the correct relogin method for services. Contributed by 
Robert Parker. (Revision 1486974)

 Result = SUCCESS
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1486974
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HftpFileSystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java


 Use the correct relogin method for services
 ---

 Key: HDFS-4780
 URL: https://issues.apache.org/jira/browse/HDFS-4780
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8
Reporter: Kihwal Lee
Assignee: Robert Parker
Priority: Minor
 Fix For: 3.0.0, 2.0.5-beta

 Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch


 A number of components call reloginFromKeytab() before making requests. For 
 StandbyCheckpointer and SecondaryNameNode, where this can be called 
 frequently, it generates many WARN messages like this:
 WARN security.UserGroupInformation: Not attempting to re-login since the last 
 re-login was attempted less than 600 seconds before.
 Other than these messages, it doesn't do anything wrong. But it will be nice 
 if it is changed to call checkTGTAndReloginFromKeytab() to avoid the 
 potentially misleading WARN messages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4860) Add additional attributes to JMX beans

2013-05-28 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4860:
-

Affects Version/s: (was: 2.0.4-alpha)
   2.0.5-beta
   3.0.0
   0.20.204.1

 Add additional attributes to JMX beans
 --

 Key: HDFS-4860
 URL: https://issues.apache.org/jira/browse/HDFS-4860
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 0.20.204.1, 3.0.0, 2.0.5-beta
Reporter: Trevor Lorimer
Priority: Trivial
 Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch


 Currently the JMX bean returns much of the data contained on the HDFS Health 
 webpage (dfsHealth.html). However there are several other attributes that are 
 required to be added.
 I intend to add the following items to the appropriate bean in parenthesis :
 Started time (NameNodeInfo),
 Compiled info (NameNodeInfo),
 Jvm MaxHeap, MaxNonHeap (JvmMetrics)
 Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo),
 Count of decommissioned Live and Dead nodes (FSNamesystemState),
 Journal Status (NodeNameInfo)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Moved] (HDFS-4860) Add additional attributes to JMX beans

2013-05-28 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik moved HADOOP-9596 to HDFS-4860:
--

  Component/s: (was: fs)
   namenode
Affects Version/s: (was: 2.0.4-alpha)
   2.0.4-alpha
  Key: HDFS-4860  (was: HADOOP-9596)
  Project: Hadoop HDFS  (was: Hadoop Common)

 Add additional attributes to JMX beans
 --

 Key: HDFS-4860
 URL: https://issues.apache.org/jira/browse/HDFS-4860
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Trevor Lorimer
Priority: Trivial
 Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch


 Currently the JMX bean returns much of the data contained on the HDFS Health 
 webpage (dfsHealth.html). However there are several other attributes that are 
 required to be added.
 I intend to add the following items to the appropriate bean in parenthesis :
 Started time (NameNodeInfo),
 Compiled info (NameNodeInfo),
 Jvm MaxHeap, MaxNonHeap (JvmMetrics)
 Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo),
 Count of decommissioned Live and Dead nodes (FSNamesystemState),
 Journal Status (NodeNameInfo)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4860) Add additional attributes to JMX beans

2013-05-28 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4860:
-

Priority: Major  (was: Trivial)

 Add additional attributes to JMX beans
 --

 Key: HDFS-4860
 URL: https://issues.apache.org/jira/browse/HDFS-4860
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 0.20.204.1, 3.0.0, 2.0.5-beta
Reporter: Trevor Lorimer
 Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch


 Currently the JMX bean returns much of the data contained on the HDFS Health 
 webpage (dfsHealth.html). However there are several other attributes that are 
 required to be added.
 I intend to add the following items to the appropriate bean in parenthesis :
 Started time (NameNodeInfo),
 Compiled info (NameNodeInfo),
 Jvm MaxHeap, MaxNonHeap (JvmMetrics)
 Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo),
 Count of decommissioned Live and Dead nodes (FSNamesystemState),
 Journal Status (NodeNameInfo)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4860) Add additional attributes to JMX beans

2013-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668408#comment-13668408
 ] 

Hadoop QA commented on HDFS-4860:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12584618/0001-Hadoop-namenode-JMX-metrics-update.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build///console

This message is automatically generated.

 Add additional attributes to JMX beans
 --

 Key: HDFS-4860
 URL: https://issues.apache.org/jira/browse/HDFS-4860
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 0.20.204.1, 3.0.0, 2.0.5-beta
Reporter: Trevor Lorimer
 Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch


 Currently the JMX bean returns much of the data contained on the HDFS Health 
 webpage (dfsHealth.html). However there are several other attributes that are 
 required to be added.
 I intend to add the following items to the appropriate bean in parenthesis :
 Started time (NameNodeInfo),
 Compiled info (NameNodeInfo),
 Jvm MaxHeap, MaxNonHeap (JvmMetrics)
 Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo),
 Count of decommissioned Live and Dead nodes (FSNamesystemState),
 Journal Status (NodeNameInfo)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4754) Add an API in the namenode to mark a datanode as stale

2013-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668425#comment-13668425
 ] 

Hadoop QA commented on HDFS-4754:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12585034/4754.v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4443//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4443//console

This message is automatically generated.

 Add an API in the namenode to mark a datanode as stale
 --

 Key: HDFS-4754
 URL: https://issues.apache.org/jira/browse/HDFS-4754
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Critical
 Attachments: 4754.v1.patch, 4754.v2.patch


 There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
 timeout, defaulted to 30s.
 There are two reasons to add an API to mark a node as stale even if the 
 timeout is not yet reached:
  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
 we sometimes start the recovery before a node is marked staled. (even with 
 reasonable settings as: stale: 20s; HBase ZK timeout: 30s
  2) Some third parties could detect that a node is dead before the timeout, 
 hence saving us the cost of retrying. An example or such hw is Arista, 
 presented here by [~tsuna] 
 http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
 HBASE-6290.
 As usual, even if the node is dead it can comeback before the 10 minutes 
 limit. So I would propose to set a timebound. The API would be
 namenode.markStale(String ipAddress, int port, long durationInMs);
 After durationInMs, the namenode would again rely only on its heartbeat to 
 decide.
 Thoughts?
 If there is no objections, and if nobody in the hdfs dev team has the time to 
 spend some time on it, I will give it a try for branch 2  3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager

2013-05-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668447#comment-13668447
 ] 

Todd Lipcon commented on HDFS-4859:
---

Hey Kihwal. Are you planning on using NFS based HA? I'd highly recommend using 
QJM instead -- it has timeout features and has been much more reliable for us 
in production clusters.

 Add timeout in FileJournalManager
 -

 Key: HDFS-4859
 URL: https://issues.apache.org/jira/browse/HDFS-4859
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Affects Versions: 2.0.4-alpha
Reporter: Kihwal Lee

 Due to absence of explicit timeout in FileJournalManager, error conditions 
 that incur long delay (usually until driver timeout) can make namenode 
 unresponsive for long time. This directly affects NN's failure detection 
 latency, which is critical in HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2013-05-28 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-4858:
--

Target Version/s: 2.0.5-beta
   Fix Version/s: (was: 2.0.5-beta)
  (was: 3.0.0)

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.0.5-beta, 2.0.4-alpha, 2.0.4.1-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor

 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4849) Idempotent create, append and delete operations.

2013-05-28 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-4849:
--

Issue Type: Improvement  (was: Bug)

 Idempotent create, append and delete operations.
 

 Key: HDFS-4849
 URL: https://issues.apache.org/jira/browse/HDFS-4849
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Konstantin Shvachko
Assignee: Konstantin Shvachko

 create, append and delete operations can be made idempotent. This will reduce 
 chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-4184) Add the ability for Client to provide more hint information for DataNode to manage the OS buffer cache more accurate

2013-05-28 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-4184.
---

Resolution: Duplicate

 Add the ability for Client to provide more hint information for DataNode to 
 manage the OS buffer cache more accurate
 

 Key: HDFS-4184
 URL: https://issues.apache.org/jira/browse/HDFS-4184
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: binlijin

 HDFS now has the ability to use posix_fadvise and sync_data_range syscalls to 
 manage the OS buffer cache.
 {code}
 When hbase read hlog the data we can set dfs.datanode.drop.cache.behind.reads 
 to true to drop data out of the buffer cache when performing sequential reads.
 When hbase write hlog we can set dfs.datanode.drop.cache.behind.writes to 
 true to drop data out of the buffer cache after writing
 When hbase read hfile during compaction we can set 
 dfs.datanode.readahead.bytes to a non-zero value to trigger readahead for 
 sequential reads, and also set dfs.datanode.drop.cache.behind.reads to true 
 to drop data out of the buffer cache when performing sequential reads.
 and so on... 
 {code}
 Current we can only set these feature global in datanode,we should set these 
 feature per session.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist

2013-05-28 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668488#comment-13668488
 ] 

Jing Zhao commented on HDFS-4847:
-

Yes, I will check the snapshot document to make sure we mention .snapshot is 
not a valid directory. 
[~schu], I will mark this as invalid first. Feel free to create a new jira if 
you think to provide more accurate error msg to end users is necessary.

 hdfs dfs -count of a .snapshot directory fails claiming file does not exist
 ---

 Key: HDFS-4847
 URL: https://issues.apache.org/jira/browse/HDFS-4847
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Stephen Chu
  Labels: snapshot, snapshots

 I successfully allow snapshots for /tmp and create three snapshots. I verify 
 that the three snapshots are in /tmp/.snapshot.
 However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does 
 not exist exception.
 Running -count on /tmp finds /tmp successfully.
 {code}
 schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot
 2013-05-24 10:27:10,070 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Found 3 items
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:26 /tmp/.snapshot/s1
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s2
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s3
 schu-mbp:~ schu$ hdfs dfs -count /tmp
 2013-05-24 10:27:20,510 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
   120  0 /tmp
 schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot
 2013-05-24 10:27:30,397 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot
 2013-05-24 10:28:23,252 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$
 {code}
 In the NN logs, I see:
 {code}
 2013-05-24 10:27:30,857 INFO  [IPC Server handler 6 on 8020] 
 FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true 
ugi=schu (auth:SIMPLE)  ip=/127.0.0.1   cmd=getfileinfo src=/tmp/.snapshot 
  dst=nullperm=null
 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] 
 security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - 
 PriviledgedActionException as:schu (auth:SIMPLE) 
 cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
 2013-05-24 10:27:30,891 INFO  [IPC Server handler 7 on 8020] ipc.Server 
 (Server.java:run(1864)) - IPC Server handler 7 on 8020, call 
 org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 
 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: 
 /tmp/.snapshot
 java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
   at 
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2267)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:3188)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:829)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:726)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48057)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836)
 {code}
 Likewise, the _hdfs dfs du_ command fails with the same problem. 
 Hadoop version:
 {code}
 schu-mbp:~ schu$ hadoop version
 Hadoop 

[jira] [Resolved] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao resolved HDFS-4847.
-

Resolution: Invalid

.snapshot is not a directory thus commands such as count and du do not 
work on path ending with .snapshot.

 hdfs dfs -count of a .snapshot directory fails claiming file does not exist
 ---

 Key: HDFS-4847
 URL: https://issues.apache.org/jira/browse/HDFS-4847
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Stephen Chu
  Labels: snapshot, snapshots

 I successfully allow snapshots for /tmp and create three snapshots. I verify 
 that the three snapshots are in /tmp/.snapshot.
 However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does 
 not exist exception.
 Running -count on /tmp finds /tmp successfully.
 {code}
 schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot
 2013-05-24 10:27:10,070 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Found 3 items
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:26 /tmp/.snapshot/s1
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s2
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s3
 schu-mbp:~ schu$ hdfs dfs -count /tmp
 2013-05-24 10:27:20,510 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
   120  0 /tmp
 schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot
 2013-05-24 10:27:30,397 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot
 2013-05-24 10:28:23,252 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$
 {code}
 In the NN logs, I see:
 {code}
 2013-05-24 10:27:30,857 INFO  [IPC Server handler 6 on 8020] 
 FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true 
ugi=schu (auth:SIMPLE)  ip=/127.0.0.1   cmd=getfileinfo src=/tmp/.snapshot 
  dst=nullperm=null
 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] 
 security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - 
 PriviledgedActionException as:schu (auth:SIMPLE) 
 cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
 2013-05-24 10:27:30,891 INFO  [IPC Server handler 7 on 8020] ipc.Server 
 (Server.java:run(1864)) - IPC Server handler 7 on 8020, call 
 org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 
 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: 
 /tmp/.snapshot
 java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
   at 
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2267)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:3188)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:829)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:726)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48057)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836)
 {code}
 Likewise, the _hdfs dfs du_ command fails with the same problem. 
 Hadoop version:
 {code}
 schu-mbp:~ schu$ hadoop version
 Hadoop 3.0.0-SNAPSHOT
 Subversion git://github.com/apache/hadoop-common.git -r 
 ccaf5ea09118eedbe17fd3f5b3f0c516221dd613
 Compiled by schu on 2013-05-24T04:45Z
 From source with checksum 

[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.

2013-05-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668497#comment-13668497
 ] 

Colin Patrick McCabe commented on HDFS-4849:


[~tlipcon] wrote:

bq. Without per-file leases, a second thread trying to create a file already 
being created would end up getting back the same block ID and causing some 
havoc.  One solution to this would be to include a nonce in the create() call, 
and store that in the INodeFileUnderConstruction, so that if you retry with the 
same nonce, it would identify it correctly as a retry.

If leases are done by inode ID rather than by path, this problem goes away.

I don't think delete can be made idempotent without changing the semantics in a 
major way.  At that point, it would be impossible to do things like have 
FSShell tell you whether your rm actually deleted anything, etc.  This isn't to 
mention things like concat...

 Idempotent create, append and delete operations.
 

 Key: HDFS-4849
 URL: https://issues.apache.org/jira/browse/HDFS-4849
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Konstantin Shvachko
Assignee: Konstantin Shvachko

 create, append and delete operations can be made idempotent. This will reduce 
 chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4857) Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-4857:


Attachment: HDFS-4857.002.patch

The failed test is also seen in HDFS-4840 and should be due to HDFS-3267. 

Update the patch to add extra check in the new unit test.

 Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into 
 INodeMap when loading FSImage
 --

 Key: HDFS-4857
 URL: https://issues.apache.org/jira/browse/HDFS-4857
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: snapshots
 Attachments: HDFS-4857.001.patch, HDFS-4857.002.patch


 Snapshot.Root, though is a subclass of INodeDirectory, is only used to 
 indicate the root of a snapshot. In the meanwhile, 
 AbstractINodeDiff#snapshotINode is used as copies recording the original 
 state of an INode. Thus we should not put them into INodeMap. 
 Currently when loading FSImage we did not check the type of inode and wrongly 
 put these two types of nodes into INodeMap. This may replace the nodes that 
 should stay in INodeMap.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager

2013-05-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668529#comment-13668529
 ] 

Colin Patrick McCabe commented on HDFS-4859:


If the disk driver hangs on a synchronous {{write(2)}} or {{read(2)}}, it 
doesn't matter what the Java software did-- the operating system thread will be 
blocked.  This is why we recommended that people soft-mount the NFS directory 
when using NFS HA.

Todd's suggestion is the best, though-- just use QJM.

 Add timeout in FileJournalManager
 -

 Key: HDFS-4859
 URL: https://issues.apache.org/jira/browse/HDFS-4859
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Affects Versions: 2.0.4-alpha
Reporter: Kihwal Lee

 Due to absence of explicit timeout in FileJournalManager, error conditions 
 that incur long delay (usually until driver timeout) can make namenode 
 unresponsive for long time. This directly affects NN's failure detection 
 latency, which is critical in HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager

2013-05-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668534#comment-13668534
 ] 

Kihwal Lee commented on HDFS-4859:
--

We will certainly use one of the HA-enabled journal managers in the future, but 
many users I've talked to want NFS-based as a first step. Even if QJM is used 
for the shared edits directory, local or NFS may still be used for storing 
extra copy of edits (as non-required resource). In this case, lack of timeout 
in FJM can affect HA with manual failover. Can health checks used with ZKFC 
detect I/O hang?

 Add timeout in FileJournalManager
 -

 Key: HDFS-4859
 URL: https://issues.apache.org/jira/browse/HDFS-4859
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Affects Versions: 2.0.4-alpha
Reporter: Kihwal Lee

 Due to absence of explicit timeout in FileJournalManager, error conditions 
 that incur long delay (usually until driver timeout) can make namenode 
 unresponsive for long time. This directly affects NN's failure detection 
 latency, which is critical in HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4850) OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-4850:


Attachment: HDFS-4850.000.patch

Initial patch fixing the reported bug. In my local test 
fsimage_008 can be read with the fix.

Will add more tests for OfflineImageViewer and upload a new patch later.

 OfflineImageViewer fails on fsimage with empty file because of 
 NegativeArraySizeException
 -

 Key: HDFS-4850
 URL: https://issues.apache.org/jira/browse/HDFS-4850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 3.0.0
Reporter: Stephen Chu
Assignee: Jing Zhao
  Labels: newbie
 Attachments: datadirs.tar.gz, fsimage_004, 
 fsimage_008, HDFS-4850.000.patch, oiv_out_1, oiv_out_2


 I deployed hadoop-trunk HDFS and created _/user/schu/_. I then forced a 
 checkpoint, fetched the fsimage, and ran the default OfflineImageViewer 
 successfully on the fsimage.
 {code}
 schu-mbp:~ schu$ hdfs oiv -i fsimage_004 -o oiv_out_1
 schu-mbp:~ schu$ cat oiv_out_1
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /user
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /user/schu
 schu-mbp:~ schu$ 
 {code}
 I then touched an empty file _/user/schu/testFile1_
 {code}
 schu-mbp:~ schu$ hadoop fs -lsr /
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - schu supergroup  0 2013-05-24 16:59 /user
 drwxr-xr-x   - schu supergroup  0 2013-05-24 17:00 /user/schu
 -rw-r--r--   1 schu supergroup  0 2013-05-24 17:00 
 /user/schu/testFile1
 {code}
 and forced another checkpoint, fetched the fsimage, and reran the 
 OfflineImageViewer. I encountered a NegativeArraySizeException:
 {code}
 schu-mbp:~ schu$ hdfs oiv -i fsimage_008 -o oiv_out_2
 Input ended unexpectedly.
 2013-05-24 17:01:13,622 ERROR [main] offlineImageViewer.OfflineImageViewer 
 (OfflineImageViewer.java:go(140)) - image loading failed at offset 402
 Exception in thread main java.lang.NegativeArraySizeException
   at org.apache.hadoop.io.Text.readString(Text.java:458)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processPermission(ImageLoaderCurrent.java:370)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINode(ImageLoaderCurrent.java:671)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processChildren(ImageLoaderCurrent.java:557)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:464)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processLocalNameINodesWithSnapshot(ImageLoaderCurrent.java:444)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINodes(ImageLoaderCurrent.java:398)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.loadImage(ImageLoaderCurrent.java:199)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.go(OfflineImageViewer.java:136)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.main(OfflineImageViewer.java:260)
 {code}
 This is reproducible. I've reproduced this scenario after formatting HDFS and 
 restarting and touching an empty file _/testFile1_.
 Attached are the data dirs, the fsimage before creating the empty file 
 (fsimage_004) and the fsimage afterwards 
 (fsimage_004) and their outputs, oiv_out_1 and oiv_out_2 
 respectively.
 The oiv_out_2 does not include the empty _/user/schu/testFile1_.
 I don't run into this problem using hadoop-2.0.4-alpha.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4850) OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException

2013-05-28 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668517#comment-13668517
 ] 

Jing Zhao commented on HDFS-4850:
-

Thanks for the testing and report Stephen! The error should be caused by a bug 
in ImageLoaderCurrent#processINode(...): {code}if (numBlocks  0){code} should 
be {code}if (numBlocks = 0){code}. I will upload a patch soon.

Also, the OfflineImageViewer requires more unit tests to test its correctness 
with the existence of snapshots in FSImage. I will add those unit tests in the 
same patch.

 OfflineImageViewer fails on fsimage with empty file because of 
 NegativeArraySizeException
 -

 Key: HDFS-4850
 URL: https://issues.apache.org/jira/browse/HDFS-4850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 3.0.0
Reporter: Stephen Chu
Assignee: Jing Zhao
  Labels: newbie
 Attachments: datadirs.tar.gz, fsimage_004, 
 fsimage_008, oiv_out_1, oiv_out_2


 I deployed hadoop-trunk HDFS and created _/user/schu/_. I then forced a 
 checkpoint, fetched the fsimage, and ran the default OfflineImageViewer 
 successfully on the fsimage.
 {code}
 schu-mbp:~ schu$ hdfs oiv -i fsimage_004 -o oiv_out_1
 schu-mbp:~ schu$ cat oiv_out_1
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /user
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /user/schu
 schu-mbp:~ schu$ 
 {code}
 I then touched an empty file _/user/schu/testFile1_
 {code}
 schu-mbp:~ schu$ hadoop fs -lsr /
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - schu supergroup  0 2013-05-24 16:59 /user
 drwxr-xr-x   - schu supergroup  0 2013-05-24 17:00 /user/schu
 -rw-r--r--   1 schu supergroup  0 2013-05-24 17:00 
 /user/schu/testFile1
 {code}
 and forced another checkpoint, fetched the fsimage, and reran the 
 OfflineImageViewer. I encountered a NegativeArraySizeException:
 {code}
 schu-mbp:~ schu$ hdfs oiv -i fsimage_008 -o oiv_out_2
 Input ended unexpectedly.
 2013-05-24 17:01:13,622 ERROR [main] offlineImageViewer.OfflineImageViewer 
 (OfflineImageViewer.java:go(140)) - image loading failed at offset 402
 Exception in thread main java.lang.NegativeArraySizeException
   at org.apache.hadoop.io.Text.readString(Text.java:458)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processPermission(ImageLoaderCurrent.java:370)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINode(ImageLoaderCurrent.java:671)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processChildren(ImageLoaderCurrent.java:557)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:464)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processLocalNameINodesWithSnapshot(ImageLoaderCurrent.java:444)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINodes(ImageLoaderCurrent.java:398)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.loadImage(ImageLoaderCurrent.java:199)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.go(OfflineImageViewer.java:136)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.main(OfflineImageViewer.java:260)
 {code}
 This is reproducible. I've reproduced this scenario after formatting HDFS and 
 restarting and touching an empty file _/testFile1_.
 Attached are the data dirs, the fsimage before creating the empty file 
 (fsimage_004) and the fsimage afterwards 
 (fsimage_004) and their outputs, oiv_out_1 and oiv_out_2 
 respectively.
 The oiv_out_2 does not include the empty _/user/schu/testFile1_.
 I don't run into this problem using hadoop-2.0.4-alpha.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HDFS-4850) OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao reassigned HDFS-4850:
---

Assignee: Jing Zhao

 OfflineImageViewer fails on fsimage with empty file because of 
 NegativeArraySizeException
 -

 Key: HDFS-4850
 URL: https://issues.apache.org/jira/browse/HDFS-4850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 3.0.0
Reporter: Stephen Chu
Assignee: Jing Zhao
  Labels: newbie
 Attachments: datadirs.tar.gz, fsimage_004, 
 fsimage_008, oiv_out_1, oiv_out_2


 I deployed hadoop-trunk HDFS and created _/user/schu/_. I then forced a 
 checkpoint, fetched the fsimage, and ran the default OfflineImageViewer 
 successfully on the fsimage.
 {code}
 schu-mbp:~ schu$ hdfs oiv -i fsimage_004 -o oiv_out_1
 schu-mbp:~ schu$ cat oiv_out_1
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /user
 drwxr-xr-x  - schu supergroup  0 2013-05-24 16:59 /user/schu
 schu-mbp:~ schu$ 
 {code}
 I then touched an empty file _/user/schu/testFile1_
 {code}
 schu-mbp:~ schu$ hadoop fs -lsr /
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - schu supergroup  0 2013-05-24 16:59 /user
 drwxr-xr-x   - schu supergroup  0 2013-05-24 17:00 /user/schu
 -rw-r--r--   1 schu supergroup  0 2013-05-24 17:00 
 /user/schu/testFile1
 {code}
 and forced another checkpoint, fetched the fsimage, and reran the 
 OfflineImageViewer. I encountered a NegativeArraySizeException:
 {code}
 schu-mbp:~ schu$ hdfs oiv -i fsimage_008 -o oiv_out_2
 Input ended unexpectedly.
 2013-05-24 17:01:13,622 ERROR [main] offlineImageViewer.OfflineImageViewer 
 (OfflineImageViewer.java:go(140)) - image loading failed at offset 402
 Exception in thread main java.lang.NegativeArraySizeException
   at org.apache.hadoop.io.Text.readString(Text.java:458)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processPermission(ImageLoaderCurrent.java:370)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINode(ImageLoaderCurrent.java:671)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processChildren(ImageLoaderCurrent.java:557)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:464)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processLocalNameINodesWithSnapshot(ImageLoaderCurrent.java:444)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINodes(ImageLoaderCurrent.java:398)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.loadImage(ImageLoaderCurrent.java:199)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.go(OfflineImageViewer.java:136)
   at 
 org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.main(OfflineImageViewer.java:260)
 {code}
 This is reproducible. I've reproduced this scenario after formatting HDFS and 
 restarting and touching an empty file _/testFile1_.
 Attached are the data dirs, the fsimage before creating the empty file 
 (fsimage_004) and the fsimage afterwards 
 (fsimage_004) and their outputs, oiv_out_1 and oiv_out_2 
 respectively.
 The oiv_out_2 does not include the empty _/user/schu/testFile1_.
 I don't run into this problem using hadoop-2.0.4-alpha.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-4855) DFSOutputStream reference should be cleared from DFSClient#filesBeingWritten if the file closure fails.

2013-05-28 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe resolved HDFS-4855.


Resolution: Duplicate

 DFSOutputStream reference should be cleared from DFSClient#filesBeingWritten 
 if the file closure fails.
 ---

 Key: HDFS-4855
 URL: https://issues.apache.org/jira/browse/HDFS-4855
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.0.4-alpha
Reporter: Vinay
Assignee: Vinay
 Fix For: 2.0.4.1-alpha


 If the file closure fails due to some exception, then {{DFSOutputStream}} 
 reference should be removed from the {{DFSClient#filesBeingWritten}}.
 Which is useless to keep and also memory consuming.
 If the same client is being used for long time,, then there is a chance of 
 client getting OOM due to this.
 fix would be simple. 
 Just cover complete {{DFSOutputStream#close()}} under try-finally and move 
 {{dfsClient.endFileLease(src);}} to finally block.
 Any thoughts..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager

2013-05-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668555#comment-13668555
 ] 

Colin Patrick McCabe commented on HDFS-4859:


If the NameNode hangs, ZKFC will detect it.

 Add timeout in FileJournalManager
 -

 Key: HDFS-4859
 URL: https://issues.apache.org/jira/browse/HDFS-4859
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Affects Versions: 2.0.4-alpha
Reporter: Kihwal Lee

 Due to absence of explicit timeout in FileJournalManager, error conditions 
 that incur long delay (usually until driver timeout) can make namenode 
 unresponsive for long time. This directly affects NN's failure detection 
 latency, which is critical in HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4857) Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage

2013-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668604#comment-13668604
 ] 

Hadoop QA commented on HDFS-4857:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12585056/HDFS-4857.002.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4445//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4445//console

This message is automatically generated.

 Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into 
 INodeMap when loading FSImage
 --

 Key: HDFS-4857
 URL: https://issues.apache.org/jira/browse/HDFS-4857
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: snapshots
 Attachments: HDFS-4857.001.patch, HDFS-4857.002.patch


 Snapshot.Root, though is a subclass of INodeDirectory, is only used to 
 indicate the root of a snapshot. In the meanwhile, 
 AbstractINodeDiff#snapshotINode is used as copies recording the original 
 state of an INode. Thus we should not put them into INodeMap. 
 Currently when loading FSImage we did not check the type of inode and wrongly 
 put these two types of nodes into INodeMap. This may replace the nodes that 
 should stay in INodeMap.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4827) Slight update to the implementation of API for handling favored nodes in DFSClient

2013-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668641#comment-13668641
 ] 

Hudson commented on HDFS-4827:
--

Integrated in Hadoop-trunk-Commit #3795 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3795/])
HDFS-4827. Slight update to the implementation of API for handling favored 
nodes in DFSClient. Contributed by Devaraj Das. (Revision 1487093)

 Result = SUCCESS
ddas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1487093
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java


 Slight update to the implementation of API for handling favored nodes in 
 DFSClient
 --

 Key: HDFS-4827
 URL: https://issues.apache.org/jira/browse/HDFS-4827
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.0.5-beta
Reporter: Devaraj Das
Assignee: Devaraj Das
 Fix For: 2.0.5-beta

 Attachments: hdfs-4827-1.txt


 Currently, the favoredNodes flavor of the DFSClient.create implementation 
 does a call to _inetSocketAddressInstance.getAddress().getHostAddress()_ This 
 wouldn't work if the inetSocketAddressInstance is unresolved (instance 
 created via InetSocketAddress.createUnresolved()). The DFSClient API should 
 handle both cases of favored-nodes' InetSocketAddresses (resolved/unresolved) 
 passed to it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4827) Slight update to the implementation of API for handling favored nodes in DFSClient

2013-05-28 Thread Devaraj Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HDFS-4827:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk  branch-2.

 Slight update to the implementation of API for handling favored nodes in 
 DFSClient
 --

 Key: HDFS-4827
 URL: https://issues.apache.org/jira/browse/HDFS-4827
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.0.5-beta
Reporter: Devaraj Das
Assignee: Devaraj Das
 Fix For: 2.0.5-beta

 Attachments: hdfs-4827-1.txt


 Currently, the favoredNodes flavor of the DFSClient.create implementation 
 does a call to _inetSocketAddressInstance.getAddress().getHostAddress()_ This 
 wouldn't work if the inetSocketAddressInstance is unresolved (instance 
 created via InetSocketAddress.createUnresolved()). The DFSClient API should 
 handle both cases of favored-nodes' InetSocketAddresses (resolved/unresolved) 
 passed to it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4861) BlockPlacementPolicyDefault does not consider decommissioning nodes

2013-05-28 Thread Kihwal Lee (JIRA)
Kihwal Lee created HDFS-4861:


 Summary: BlockPlacementPolicyDefault does not consider 
decommissioning nodes
 Key: HDFS-4861
 URL: https://issues.apache.org/jira/browse/HDFS-4861
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.7, 2.0.5-beta
Reporter: Kihwal Lee


getMaxNodesPerRack() calculates the max replicas/rack like this:

{code}
int maxNodesPerRack = (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
{code}

Since this does not consider the racks that are being decommissioned and the 
decommissioning state is only checked later in isGoodTarget(), certain blocks 
are not replicated even when there are many racks and nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4861) BlockPlacementPolicyDefault does not consider decommissioning racks

2013-05-28 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-4861:
-

Summary: BlockPlacementPolicyDefault does not consider decommissioning 
racks  (was: BlockPlacementPolicyDefault does not consider decommissioning 
nodes)

 BlockPlacementPolicyDefault does not consider decommissioning racks
 ---

 Key: HDFS-4861
 URL: https://issues.apache.org/jira/browse/HDFS-4861
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.7, 2.0.5-beta
Reporter: Kihwal Lee

 getMaxNodesPerRack() calculates the max replicas/rack like this:
 {code}
 int maxNodesPerRack = (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
 {code}
 Since this does not consider the racks that are being decommissioned and the 
 decommissioning state is only checked later in isGoodTarget(), certain blocks 
 are not replicated even when there are many racks and nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4754) Add an API in the namenode to mark a datanode as stale

2013-05-28 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668718#comment-13668718
 ] 

Nick Dimiduk commented on HDFS-4754:


bq. There is a setting to disable it if needed, by configuring a max duration 
to zero.

Can this be documented somewhere, {{hdfs-default.xml}} for example?

 Add an API in the namenode to mark a datanode as stale
 --

 Key: HDFS-4754
 URL: https://issues.apache.org/jira/browse/HDFS-4754
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client, namenode
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Critical
 Attachments: 4754.v1.patch, 4754.v2.patch


 There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
 timeout, defaulted to 30s.
 There are two reasons to add an API to mark a node as stale even if the 
 timeout is not yet reached:
  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
 we sometimes start the recovery before a node is marked staled. (even with 
 reasonable settings as: stale: 20s; HBase ZK timeout: 30s
  2) Some third parties could detect that a node is dead before the timeout, 
 hence saving us the cost of retrying. An example or such hw is Arista, 
 presented here by [~tsuna] 
 http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
 HBASE-6290.
 As usual, even if the node is dead it can comeback before the 10 minutes 
 limit. So I would propose to set a timebound. The API would be
 namenode.markStale(String ipAddress, int port, long durationInMs);
 After durationInMs, the namenode would again rely only on its heartbeat to 
 decide.
 Thoughts?
 If there is no objections, and if nobody in the hdfs dev team has the time to 
 spend some time on it, I will give it a try for branch 2  3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager

2013-05-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668772#comment-13668772
 ] 

Kihwal Lee commented on HDFS-4859:
--

bq. If the NameNode hangs, ZKFC will detect it.

I understand that ZKFC will detect the failures if NN does not respond to RPC 
calls or the internal resource check fails.  If all RPC handlers are waiting 
for a very long logSync() to finish, this may be detected as well. But if a 
couple of handlers are in trouble due to I/O hang and all others are happily 
serving reads, the error condition may not be detected in time. The situation 
will be different, of course, if the underlying journal flush can timeout.

I think adding timeout will still be useful since users can run combination of 
a HA-JM and FJM. Ideally, NN should be able to detect and exclude failed 
storages with a predictable/configurable latency, regardless of underlying 
implementation. 

 Add timeout in FileJournalManager
 -

 Key: HDFS-4859
 URL: https://issues.apache.org/jira/browse/HDFS-4859
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Affects Versions: 2.0.4-alpha
Reporter: Kihwal Lee

 Due to absence of explicit timeout in FileJournalManager, error conditions 
 that incur long delay (usually until driver timeout) can make namenode 
 unresponsive for long time. This directly affects NN's failure detection 
 latency, which is critical in HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4862) SafeModeInfo.isManual() returns true when resources are low even if it wasn't entered into manually

2013-05-28 Thread Ravi Prakash (JIRA)
Ravi Prakash created HDFS-4862:
--

 Summary: SafeModeInfo.isManual() returns true when resources are 
low even if it wasn't entered into manually
 Key: HDFS-4862
 URL: https://issues.apache.org/jira/browse/HDFS-4862
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.0.4-alpha, 0.23.7, 3.0.0
Reporter: Ravi Prakash


HDFS-1594 changed isManual to this
{code}
private boolean isManual() {
  return extension == Integer.MAX_VALUE  !resourcesLow;
}
{code}
One immediate impact of this is that when resources are low, the NN will throw 
away all block reports from DNs. This is undesirable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage

2013-05-28 Thread Jing Zhao (JIRA)
Jing Zhao created HDFS-4863:
---

 Summary: The root directory should be added to the snapshottable 
directory list while loading fsimage 
 Key: HDFS-4863
 URL: https://issues.apache.org/jira/browse/HDFS-4863
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Jing Zhao
Assignee: Jing Zhao


When the root directory is set as snapshottable, its snapshot quota is changed 
from 0 to a positive number. While loading fsimage we should check the root's 
snapshot quota and add it to snapshottable directory list in SnapshotManager if 
necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4848) copyFromLocal and renaming a file to .snapshot should output that .snapshot is a reserved name

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-4848:


Attachment: HDFS-4848.000.patch

The uploaded patch catches and re-throws the IllegalNameException when we find 
that .snapshot is used as the target name. I manually tested it in a local 
cluster and we can get the expected .snapshot is a reserved name exception 
msg with the patch.

I will add one more unit test to make sure the rename undo section still works.

 copyFromLocal and renaming a file to .snapshot should output that 
 .snapshot is a reserved name
 --

 Key: HDFS-4848
 URL: https://issues.apache.org/jira/browse/HDFS-4848
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Stephen Chu
Assignee: Jing Zhao
Priority: Trivial
  Labels: snapshot, snapshots
 Attachments: HDFS-4848.000.patch


 Might be an unlikely scenario, but if users copyFromLocal a file/dir from 
 local to HDFS and want the file/dir to be renamed to _.snapshot_, then 
 they'll see an Input/output error  like the following:
 {code}
 schu-mbp:~ schu$ hdfs dfs -copyFromLocal testFile1 /tmp/.snapshot
 copyFromLocal: rename `/tmp/.snapshot._COPYING_' to `/tmp/.snapshot': 
 Input/output error
 {code}
 It'd be more clear if the error output was the usual _.snapshot is a 
 reserved name_ (this does show in the NN logs, though).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-4863:


Affects Version/s: 3.0.0

 The root directory should be added to the snapshottable directory list while 
 loading fsimage 
 -

 Key: HDFS-4863
 URL: https://issues.apache.org/jira/browse/HDFS-4863
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao

 When the root directory is set as snapshottable, its snapshot quota is 
 changed from 0 to a positive number. While loading fsimage we should check 
 the root's snapshot quota and add it to snapshottable directory list in 
 SnapshotManager if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-4863:


Component/s: snapshots

 The root directory should be added to the snapshottable directory list while 
 loading fsimage 
 -

 Key: HDFS-4863
 URL: https://issues.apache.org/jira/browse/HDFS-4863
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao

 When the root directory is set as snapshottable, its snapshot quota is 
 changed from 0 to a positive number. While loading fsimage we should check 
 the root's snapshot quota and add it to snapshottable directory list in 
 SnapshotManager if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-4863:


Labels: snapshots  (was: )

 The root directory should be added to the snapshottable directory list while 
 loading fsimage 
 -

 Key: HDFS-4863
 URL: https://issues.apache.org/jira/browse/HDFS-4863
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: snapshots

 When the root directory is set as snapshottable, its snapshot quota is 
 changed from 0 to a positive number. While loading fsimage we should check 
 the root's snapshot quota and add it to snapshottable directory list in 
 SnapshotManager if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage

2013-05-28 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-4863:


Attachment: HDFS-4863.001.patch

A patch based on HDFS-4857.

 The root directory should be added to the snapshottable directory list while 
 loading fsimage 
 -

 Key: HDFS-4863
 URL: https://issues.apache.org/jira/browse/HDFS-4863
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: snapshots
 Attachments: HDFS-4863.001.patch


 When the root directory is set as snapshottable, its snapshot quota is 
 changed from 0 to a positive number. While loading fsimage we should check 
 the root's snapshot quota and add it to snapshottable directory list in 
 SnapshotManager if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist

2013-05-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668835#comment-13668835
 ] 

Aaron T. Myers commented on HDFS-4847:
--

Hey Jing, not sure I agree with this reasoning. Why shouldn't `hadoop fs 
-count' work on a '.snapshot' pseudo-directory, just as it does on a real 
directory? I'd think it would just add up all of the files/space consumed in 
all of the snapshots under that pseudo-directory and report that back. I'd 
think that basically all read-only commands should work, much as they do in the 
special directories '.zfs' and '.snapshot' in ZFS and WAFL, respectively.

Also, note that, contrary to your last comment, `hadoop fs -du' does appear to 
currently work on a '.snapshot' directory:

{noformat}
$ hadoop fs -du .snapshot
3338  .snapshot/s20130528-165940.694
3338  .snapshot/s20130528-170045.101
3338  .snapshot/s20130528-170828.222
{noformat}

Though this output is not quite correct, since in these snapshots only the last 
one actually contains any files which have non-zero space, but they're all 
showing 3338 bytes consumed:

{noformat}
$ hadoop fs -ls .snapshot/*
Found 2 items
drwxr-xr-x   - atm atm  0 2013-05-28 16:56 
.snapshot/s20130528-165940.694/bar
drwxr-xr-x   - atm atm  0 2013-05-28 16:56 
.snapshot/s20130528-165940.694/foo
Found 2 items
drwxr-xr-x   - atm atm  0 2013-05-28 16:56 
.snapshot/s20130528-170045.101/bar
drwxr-xr-x   - atm atm  0 2013-05-28 16:56 
.snapshot/s20130528-170045.101/foo
Found 3 items
-rw-r--r--   1 atm atm   3338 2013-05-28 17:08 
.snapshot/s20130528-170828.222/.bashrc
drwxr-xr-x   - atm atm  0 2013-05-28 16:56 
.snapshot/s20130528-170828.222/bar
drwxr-xr-x   - atm atm  0 2013-05-28 16:56 
.snapshot/s20130528-170828.222/foo
{noformat}

 hdfs dfs -count of a .snapshot directory fails claiming file does not exist
 ---

 Key: HDFS-4847
 URL: https://issues.apache.org/jira/browse/HDFS-4847
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Stephen Chu
  Labels: snapshot, snapshots

 I successfully allow snapshots for /tmp and create three snapshots. I verify 
 that the three snapshots are in /tmp/.snapshot.
 However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does 
 not exist exception.
 Running -count on /tmp finds /tmp successfully.
 {code}
 schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot
 2013-05-24 10:27:10,070 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Found 3 items
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:26 /tmp/.snapshot/s1
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s2
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s3
 schu-mbp:~ schu$ hdfs dfs -count /tmp
 2013-05-24 10:27:20,510 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
   120  0 /tmp
 schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot
 2013-05-24 10:27:30,397 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot
 2013-05-24 10:28:23,252 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$
 {code}
 In the NN logs, I see:
 {code}
 2013-05-24 10:27:30,857 INFO  [IPC Server handler 6 on 8020] 
 FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true 
ugi=schu (auth:SIMPLE)  ip=/127.0.0.1   cmd=getfileinfo src=/tmp/.snapshot 
  dst=nullperm=null
 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] 
 security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - 
 PriviledgedActionException as:schu (auth:SIMPLE) 
 cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
 2013-05-24 10:27:30,891 INFO  [IPC Server handler 7 on 8020] ipc.Server 
 (Server.java:run(1864)) - IPC Server handler 7 on 8020, call 
 org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 
 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: 
 /tmp/.snapshot
 java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
   at 
 

[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager

2013-05-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668900#comment-13668900
 ] 

Colin Patrick McCabe commented on HDFS-4859:


The Linux kernel doesn't allow you to set a timeout on I/O operations, unless 
you use O_DIRECT and async I/O.  If operations with the local filesystem take 
longer than you would like, what are you going to have the NameNode do?  Kill 
itself?  It can't even kill itself if it is hung on a write, because the 
process will be in D state, otherwise known as uninterruptible sleep.

In this scenario, the NameNode worker thread will be blocked forever, probably 
while holding the FSImage lock.  There is nothing you can do.  You can't kill 
the thread, and even if you could, how would you get the mutex back?  There is 
nothing Java can do when the OS decides your thread cannot run.

The solution to your problem is that you can easily set a timeout on NFS 
operations by using a soft mount plus {{timeo=60}} (or whatever timeout you 
want).

 Add timeout in FileJournalManager
 -

 Key: HDFS-4859
 URL: https://issues.apache.org/jira/browse/HDFS-4859
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Affects Versions: 2.0.4-alpha
Reporter: Kihwal Lee

 Due to absence of explicit timeout in FileJournalManager, error conditions 
 that incur long delay (usually until driver timeout) can make namenode 
 unresponsive for long time. This directly affects NN's failure detection 
 latency, which is critical in HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4832) Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave

2013-05-28 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-4832:
---

Attachment: HDFS-4832.patch

Thanks for your review Kihwal. I've updated the patch.
bq. isInStartupSafeMode() returns true for any auto safe mode. E.g. if the 
resource checker puts NN in safe mode, it will return true.
I have filed HDFS-4862 to fix this. The method name is unfortunately contrary 
to its behavior.
{quote}
The existing code drained scheduled work in safe mode, but the patch makes it 
immediately stops sending scheduled work to DNs. This seems correct behavior 
for safe mode, but those work can be sent out after leaving safe mode. That may 
not be ideal. E.g. if NN is suffering from a flaky DNS, DNs will appear dead, 
come back and dead again, generating a lot of invalidation and replication 
work. Admins may put NN in safe mode to safely pass the storm. When they do, 
the unnecessary work needs to stop rather than being delayed. Please make sure 
unintended damage does not occur after leaving safe mode.
{quote}
UnderReplicatedBlocks is the priority queue maintained for neededReplications, 
and it is updated when nodes join or are marked dead. However, once 
BlockManager.computeReplicationWorkForBlocks is called, the ReplicationWork is 
transferred to the DatanodeDescriptor's replicateBlocks queue, from which it 
will not be rescinded. The computeReplicationWorkForBlocks() is called every 
replicationRecheckInterval which defaults to 3 seconds. Can we please handle 
this in a separate JIRA?

 Namenode doesn't change the number of missing blocks in safemode when DNs 
 rejoin or leave
 -

 Key: HDFS-4832
 URL: https://issues.apache.org/jira/browse/HDFS-4832
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Ravi Prakash
Assignee: Ravi Prakash
Priority: Critical
 Attachments: HDFS-4832.patch, HDFS-4832.patch, HDFS-4832.patch


 Courtesy Karri VRK Reddy!
 {quote}
 1. Namenode lost datanodes causing missing blocks
 2. Namenode was put in safe mode
 3. Datanode restarted on dead nodes 
 4. Waited for lots of time for the NN UI to reflect the recovered blocks.
 5. Forced NN out of safe mode and suddenly,  no more missing blocks anymore.
 {quote}
 I was able to replicate this on 0.23 and trunk. I set 
 dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate 
 lost datanode. The opposite case also has problems (i.e. Datanode failing 
 when NN is in safemode, doesn't lead to a missing blocks message)
 Without the NN updating this list of missing blocks, the grid admins will not 
 know when to take the cluster out of safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4832) Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave

2013-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668973#comment-13668973
 ] 

Hadoop QA commented on HDFS-4832:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12585143/HDFS-4832.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestFSNamesystem

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4446//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4446//console

This message is automatically generated.

 Namenode doesn't change the number of missing blocks in safemode when DNs 
 rejoin or leave
 -

 Key: HDFS-4832
 URL: https://issues.apache.org/jira/browse/HDFS-4832
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Ravi Prakash
Assignee: Ravi Prakash
Priority: Critical
 Attachments: HDFS-4832.patch, HDFS-4832.patch, HDFS-4832.patch


 Courtesy Karri VRK Reddy!
 {quote}
 1. Namenode lost datanodes causing missing blocks
 2. Namenode was put in safe mode
 3. Datanode restarted on dead nodes 
 4. Waited for lots of time for the NN UI to reflect the recovered blocks.
 5. Forced NN out of safe mode and suddenly,  no more missing blocks anymore.
 {quote}
 I was able to replicate this on 0.23 and trunk. I set 
 dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate 
 lost datanode. The opposite case also has problems (i.e. Datanode failing 
 when NN is in safemode, doesn't lead to a missing blocks message)
 Without the NN updating this list of missing blocks, the grid admins will not 
 know when to take the cluster out of safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4842) Snapshot: identify the correct prior snapshot when deleting a snapshot under a renamed subtree

2013-05-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-4842:
-

Hadoop Flags: Reviewed

+1 patch looks good.

 Snapshot: identify the correct prior snapshot when deleting a snapshot under 
 a renamed subtree
 --

 Key: HDFS-4842
 URL: https://issues.apache.org/jira/browse/HDFS-4842
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-4842.000.patch, HDFS-4842.001.patch, 
 HDFS-4842.002.patch


 In our long-term running tests for snapshot we find the following bug:
 1. initially we have directories /test, /test/dir1 and /test/dir2/foo. 
 2. first take snapshot s0 and s1 on /test.
 3. modify some descendant of foo (e.g., delete foo/bar/file), to make sure 
 some changes have been recorded to the snapshot diff associated with s1.
 4. take snapshot s2 on /test/dir2
 5. move foo from dir2 to dir1, i.e., rename /test/dir2/foo to /test/dir1/foo
 6. delete snapshot s1
 After step 6, the snapshot copy of foo/bar/file should have been merged from 
 s1 to s0 (i.e., s0 should be identified as the prior snapshot of s1). 
 However, the current code failed to identify the correct prior snapshot in 
 the source tree of the rename operation and wrongly used s2 as the prior 
 snapshot.
 The bug only exists when nested snapshottable directories are enabled. To fix 
 the bug, we need to go upwards in the source tree of the rename operation 
 (i.e., dir2) to identify the correct prior snapshot in the above scenario. 
 This jira will fix the bug and add several corresponding unit tests. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.

2013-05-28 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668993#comment-13668993
 ] 

Konstantin Shvachko commented on HDFS-4849:
---

Given Matthew's comment I think I should have provided more motivation for the 
issue first. The idea to make these changes comes from the desire to have MR 
and YARN run the jobs without interruption in HA case.
Today if NameNode dies and failover to StandbyNode occurs some jobs can fail. 
This mostly depends on whether failure of NN happened during idempotent or 
non-idempotent operation. Idempotent operations, like getBlockLocations or 
addBlock, are retried and the client will eventually complete such operation 
via StandbyNode, when SBN becomes active. Non-idempotent operations like create 
and delete are not retried, they just fail. Therefore, MR job fails if it tries 
to create an output file for a reducer or delete a directory at cleanup stage 
just at the moment NN crashes. While if it could retry the create on SBN, it 
would have succeeded.
So we might need to compromize and loozen the semantics of some HDFS operations 
in order to satisfy stricter availabilty and scalability requirements. And we 
better do it now before APIs are frozen for branch 2.

 Idempotent create, append and delete operations.
 

 Key: HDFS-4849
 URL: https://issues.apache.org/jira/browse/HDFS-4849
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Konstantin Shvachko
Assignee: Konstantin Shvachko

 create, append and delete operations can be made idempotent. This will reduce 
 chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4857) Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage

2013-05-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-4857:
-

Hadoop Flags: Reviewed

+1 patch looks good.

 Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into 
 INodeMap when loading FSImage
 --

 Key: HDFS-4857
 URL: https://issues.apache.org/jira/browse/HDFS-4857
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: snapshots
 Attachments: HDFS-4857.001.patch, HDFS-4857.002.patch


 Snapshot.Root, though is a subclass of INodeDirectory, is only used to 
 indicate the root of a snapshot. In the meanwhile, 
 AbstractINodeDiff#snapshotINode is used as copies recording the original 
 state of an INode. Thus we should not put them into INodeMap. 
 Currently when loading FSImage we did not check the type of inode and wrongly 
 put these two types of nodes into INodeMap. This may replace the nodes that 
 should stay in INodeMap.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.

2013-05-28 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669000#comment-13669000
 ] 

Konstantin Shvachko commented on HDFS-4849:
---

Steve, you got a good example. Rename is the most tricky operation in file 
systems. And concurrency logic is always an issue in distributed storage.
Suppose in current HDFS client1 renames (moves) file A to B when B exists, 
which should replace B with contents of A. Suppose then that at the same time 
client2 deletes file B. Since there is no guarantee which operation is executed 
first you can either end up with A renamed to B, if the delete goes first, or 
with no files if the rename prevails followed by the deletion of B.
This is similar to your case. When client1 retries from its perspective delete 
was not completed, so it deletes again. And it is not different from the case 
when client1 is slow and executes delete after rename. Or if there are other 
clients besides 1 and 2 doing something with /path.
My point is that if you need to coordinate clients you should do it with some 
external tools, like ZK.

 Idempotent create, append and delete operations.
 

 Key: HDFS-4849
 URL: https://issues.apache.org/jira/browse/HDFS-4849
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Konstantin Shvachko
Assignee: Konstantin Shvachko

 create, append and delete operations can be made idempotent. This will reduce 
 chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage

2013-05-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669004#comment-13669004
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-4863:
--

How about we remove the first if-statement and combine the non-root case to the 
second if-statement?  I.e.
{code}
  if (numSnapshots = 0) {
final INodeDirectorySnapshottable snapshottableParent
= INodeDirectorySnapshottable.valueOf(parent, 
parent.getLocalName());
// load snapshots and snapshotQuota
SnapshotFSImageFormat.loadSnapshotList(snapshottableParent,
numSnapshots, in, this);
if (snapshottableParent.getSnapshotQuota()  0) {
  this.namesystem.getSnapshotManager().addSnapshottable(
  snapshottableParent);
}
  }
{code}


 The root directory should be added to the snapshottable directory list while 
 loading fsimage 
 -

 Key: HDFS-4863
 URL: https://issues.apache.org/jira/browse/HDFS-4863
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: snapshots
 Attachments: HDFS-4863.001.patch


 When the root directory is set as snapshottable, its snapshot quota is 
 changed from 0 to a positive number. While loading fsimage we should check 
 the root's snapshot quota and add it to snapshottable directory list in 
 SnapshotManager if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.

2013-05-28 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669007#comment-13669007
 ] 

Konstantin Shvachko commented on HDFS-4849:
---

 A given client may have multiple threads race to create the same file, and 
 those threads would share the same client name (and hence lease).

If you go through FileSystem API that should not happen, because 
DFSClient.clientName includes the thread name. Yes, I can write a test which 
directly makes RPC calls to NN and fakes them with the same clientName for 
different threads. But this is not public HDFS APIs, and there is many other 
ways to abuse the system. Should we care?

 the only solution I could come up with is something like NFS's duplicate 
 request cache

Great point Todd. This would have been a universal solution for deletes, 
renames, even concat. Was thinking about it too and thought this would 
complicate things.
I see only one issue with delete that prevents it from being idempotent - it's 
the return value, which must be true only if the deleted object existed and was 
actually deleted. This cannot be guaranteed through retries.
The semantics of delete should be that _object does not exist after delete 
completes_. This seems idempotent to me. The return value should be treated as 
success or failure. Same as in mkdir.

 Idempotent create, append and delete operations.
 

 Key: HDFS-4849
 URL: https://issues.apache.org/jira/browse/HDFS-4849
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Konstantin Shvachko
Assignee: Konstantin Shvachko

 create, append and delete operations can be made idempotent. This will reduce 
 chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist

2013-05-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669010#comment-13669010
 ] 

Suresh Srinivas commented on HDFS-4847:
---

bq. Also, note that, contrary to your last comment, `hadoop fs -du' does appear 
to currently work on a '.snapshot' directory:

I would rather turn it off.

I think .snapshot is not a directory. Lets not try to make it like one for 
administrative commands and add complexity. But if anyone wants to pursue it, 
lets do it in a separate jira. As it stand currently, the functionality 
expected by this jira is not supported.

 hdfs dfs -count of a .snapshot directory fails claiming file does not exist
 ---

 Key: HDFS-4847
 URL: https://issues.apache.org/jira/browse/HDFS-4847
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Stephen Chu
  Labels: snapshot, snapshots

 I successfully allow snapshots for /tmp and create three snapshots. I verify 
 that the three snapshots are in /tmp/.snapshot.
 However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does 
 not exist exception.
 Running -count on /tmp finds /tmp successfully.
 {code}
 schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot
 2013-05-24 10:27:10,070 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Found 3 items
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:26 /tmp/.snapshot/s1
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s2
 drwxr-xr-x   - schu supergroup  0 2013-05-24 10:27 /tmp/.snapshot/s3
 schu-mbp:~ schu$ hdfs dfs -count /tmp
 2013-05-24 10:27:20,510 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
   120  0 /tmp
 schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot
 2013-05-24 10:27:30,397 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot
 2013-05-24 10:28:23,252 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 count: File does not exist: /tmp/.snapshot
 schu-mbp:~ schu$
 {code}
 In the NN logs, I see:
 {code}
 2013-05-24 10:27:30,857 INFO  [IPC Server handler 6 on 8020] 
 FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true 
ugi=schu (auth:SIMPLE)  ip=/127.0.0.1   cmd=getfileinfo src=/tmp/.snapshot 
  dst=nullperm=null
 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] 
 security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - 
 PriviledgedActionException as:schu (auth:SIMPLE) 
 cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
 2013-05-24 10:27:30,891 INFO  [IPC Server handler 7 on 8020] ipc.Server 
 (Server.java:run(1864)) - IPC Server handler 7 on 8020, call 
 org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 
 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: 
 /tmp/.snapshot
 java.io.FileNotFoundException: File does not exist: /tmp/.snapshot
   at 
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2267)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:3188)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:829)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:726)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48057)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489)
   at 

[jira] [Commented] (HDFS-4846) Snapshot CLI commands output stacktrace for invalid arguments

2013-05-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669013#comment-13669013
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-4846:
--

For the admin commands, we indeed should not unwrapRemoteException in DFSClient 
allowSnapshot()/disallowSnapshot().

 Snapshot CLI commands output stacktrace for invalid arguments
 -

 Key: HDFS-4846
 URL: https://issues.apache.org/jira/browse/HDFS-4846
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Stephen Chu
Assignee: Jing Zhao
Priority: Minor
  Labels: snapshot
 Attachments: HDFS-4846.001.patch, HDFS-4846.002.patch, 
 HDFS-4846.003.patch


 It'd be useful to clean up the stacktraces output by the snapshot CLI 
 commands when the commands are used incorrectly. This will make things more 
 readable for operators and hopefully prevent confusion.
 Allowing a snapshot on a directory that doesn't exist
 {code}
 schu-mbp:~ schu$ hdfs dfsadmin -allowSnapshot adfasdf
 2013-05-23 15:46:46.052 java[24580:1203] Unable to load realm info from 
 SCDynamicStore
 2013-05-23 15:46:46,066 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 allowSnapshot: Directory does not exist: /user/schu/adfasdf
   at 
 org.apache.hadoop.hdfs.server.namenode.INodeDirectory.valueOf(INodeDirectory.java:52)
   at 
 org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.setSnapshottable(SnapshotManager.java:106)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.allowSnapshot(FSNamesystem.java:5861)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.allowSnapshot(NameNodeRpcServer.java:1121)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.allowSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:932)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48087)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836)
 schu-mbp:~ schu$ 
 {code}
 Disallow a snapshot on a directory that isn't snapshottable
 {code}
 schu-mbp:~ schu$ hdfs dfsadmin -disallowSnapshot /user
 2013-05-23 15:49:07.251 java[24687:1203] Unable to load realm info from 
 SCDynamicStore
 2013-05-23 15:49:07,265 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 disallowSnapshot: Directory is not a snapshottable directory: /user
   at 
 org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.valueOf(INodeDirectorySnapshottable.java:68)
   at 
 org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.resetSnapshottable(SnapshotManager.java:151)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.disallowSnapshot(FSNamesystem.java:5889)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.disallowSnapshot(NameNodeRpcServer.java:1128)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.disallowSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:943)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48089)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836)
 {code}
 Snapshot diffs with non-existent snapshot paths
 

[jira] [Commented] (HDFS-4848) copyFromLocal and renaming a file to .snapshot should output that .snapshot is a reserved name

2013-05-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669015#comment-13669015
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-4848:
--

For /.reserved, FSDirectory.addChild(..) throws HadoopIllegalArgumentException. 
 Let's throw the same exception for .snapshot.

 copyFromLocal and renaming a file to .snapshot should output that 
 .snapshot is a reserved name
 --

 Key: HDFS-4848
 URL: https://issues.apache.org/jira/browse/HDFS-4848
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Stephen Chu
Assignee: Jing Zhao
Priority: Trivial
  Labels: snapshot, snapshots
 Attachments: HDFS-4848.000.patch


 Might be an unlikely scenario, but if users copyFromLocal a file/dir from 
 local to HDFS and want the file/dir to be renamed to _.snapshot_, then 
 they'll see an Input/output error  like the following:
 {code}
 schu-mbp:~ schu$ hdfs dfs -copyFromLocal testFile1 /tmp/.snapshot
 copyFromLocal: rename `/tmp/.snapshot._COPYING_' to `/tmp/.snapshot': 
 Input/output error
 {code}
 It'd be more clear if the error output was the usual _.snapshot is a 
 reserved name_ (this does show in the NN logs, though).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.

2013-05-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669018#comment-13669018
 ] 

Suresh Srinivas commented on HDFS-4849:
---

bq. because DFSClient.clientName includes the thread name...
How do you guarantee RPC call ID + client name is unique?

bq. The semantics of delete should be that object does not exist after delete 
completes. This seems idempotent to me.
This definition is not complete. Slightly rephrasing uniquely identified 
object does not exist after delete completes. In this regard, any deletion 
that identifies the object using path, which is not unique will not work. 
Between two retries, if another client creates the path being deleted, second 
retry could delete a file that should not be deleted. I think fileID/inodeID 
recently introduced can make delete idempotent, in cases where client knows 
about the file ID of the file. This will not work for deletions based on path 
alone.




 Idempotent create, append and delete operations.
 

 Key: HDFS-4849
 URL: https://issues.apache.org/jira/browse/HDFS-4849
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.4-alpha
Reporter: Konstantin Shvachko
Assignee: Konstantin Shvachko

 create, append and delete operations can be made idempotent. This will reduce 
 chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4845) FSEditLogLoader gets NPE while accessing INodeMap in TestEditLogRace

2013-05-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669019#comment-13669019
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-4845:
--

 It looks like loadInodeWithLocalName is only invoked during namenode startup, 
 except for tests so a lock may not be necessary. However I added it for 
 completeness.

I think it should add the lock in the tests.  Otherwise, it would slow down 
startup.

 FSEditLogLoader gets NPE while accessing INodeMap in TestEditLogRace
 

 Key: HDFS-4845
 URL: https://issues.apache.org/jira/browse/HDFS-4845
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0
Reporter: Kihwal Lee
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-4845.001.patch, HDFS-4845.002.patch


 TestEditLogRace fails occasionally because it gets NPE from manipulating 
 INodeMap while loading edits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira