[jira] [Updated] (HDFS-4827) Slight update to the implementation of API for handling favored nodes in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-4827: - Hadoop Flags: Reviewed +1 patch looks good. Slight update to the implementation of API for handling favored nodes in DFSClient -- Key: HDFS-4827 URL: https://issues.apache.org/jira/browse/HDFS-4827 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.5-beta Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 2.0.5-beta Attachments: hdfs-4827-1.txt Currently, the favoredNodes flavor of the DFSClient.create implementation does a call to _inetSocketAddressInstance.getAddress().getHostAddress()_ This wouldn't work if the inetSocketAddressInstance is unresolved (instance created via InetSocketAddress.createUnresolved()). The DFSClient API should handle both cases of favored-nodes' InetSocketAddresses (resolved/unresolved) passed to it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
Jagane Sundar created HDFS-4858: --- Summary: HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.4-alpha, 3.0.0, 2.0.5-beta, 2.0.4.1-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.0.5-beta The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4832) Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave
[ https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668318#comment-13668318 ] Kihwal Lee commented on HDFS-4832: -- Here are some comments: * The condition for figuring non-initial safe mode: HA state is already checked in the namenode method, so you don't have to check. * isInStartupSafeMode() returns true for any auto safe mode. E.g. if the resource checker puts NN in safe mode, it will return true. * The existing code drained scheduled work in safe mode, but the patch makes it immediately stops sending scheduled work to DNs. This seems correct behavior for safe mode, but those work can be sent out after leaving safe mode. That may not be ideal. E.g. if NN is suffering from a flaky DNS, DNs will appear dead, come back and dead again, generating a lot of invalidation and replication work. Admins may put NN in safe mode to safely pass the storm. When they do, the unnecessary work needs to stop rather than being delayed. Please make sure unintended damage does not occur after leaving safe mode. Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave - Key: HDFS-4832 URL: https://issues.apache.org/jira/browse/HDFS-4832 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Ravi Prakash Assignee: Ravi Prakash Priority: Critical Attachments: HDFS-4832.patch, HDFS-4832.patch Courtesy Karri VRK Reddy! {quote} 1. Namenode lost datanodes causing missing blocks 2. Namenode was put in safe mode 3. Datanode restarted on dead nodes 4. Waited for lots of time for the NN UI to reflect the recovered blocks. 5. Forced NN out of safe mode and suddenly, no more missing blocks anymore. {quote} I was able to replicate this on 0.23 and trunk. I set dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate lost datanode. The opposite case also has problems (i.e. Datanode failing when NN is in safemode, doesn't lead to a missing blocks message) Without the NN updating this list of missing blocks, the grid admins will not know when to take the cluster out of safemode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.
[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668329#comment-13668329 ] Matthew Farrellee commented on HDFS-4849: - As a member of the community trying to create a FileSystem implementation, I view these proposed changes as significant deviations from the semantics that are being described as part of HADOOP-9371. The changes will better some use cases while worsening others. The ability to implement them across all FileSystems will also vary dramatically. Please discuss the possibility of FileSystems optionally implementing these enhanced semantics as part of HADOOP-9371, and do not add them to a 2.0.X. Idempotent create, append and delete operations. Key: HDFS-4849 URL: https://issues.apache.org/jira/browse/HDFS-4849 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.4-alpha Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko create, append and delete operations can be made idempotent. This will reduce chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4754) Add an API in the namenode to mark a datanode as stale
[ https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HDFS-4754: -- Status: Open (was: Patch Available) Add an API in the namenode to mark a datanode as stale -- Key: HDFS-4754 URL: https://issues.apache.org/jira/browse/HDFS-4754 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Attachments: 4754.v1.patch There is a detection of the stale datanodes in HDFS since HDFS-3703, with a timeout, defaulted to 30s. There are two reasons to add an API to mark a node as stale even if the timeout is not yet reached: 1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, we sometimes start the recovery before a node is marked staled. (even with reasonable settings as: stale: 20s; HBase ZK timeout: 30s 2) Some third parties could detect that a node is dead before the timeout, hence saving us the cost of retrying. An example or such hw is Arista, presented here by [~tsuna] http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in HBASE-6290. As usual, even if the node is dead it can comeback before the 10 minutes limit. So I would propose to set a timebound. The API would be namenode.markStale(String ipAddress, int port, long durationInMs); After durationInMs, the namenode would again rely only on its heartbeat to decide. Thoughts? If there is no objections, and if nobody in the hdfs dev team has the time to spend some time on it, I will give it a try for branch 2 3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4754) Add an API in the namenode to mark a datanode as stale
[ https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HDFS-4754: -- Status: Patch Available (was: Open) Add an API in the namenode to mark a datanode as stale -- Key: HDFS-4754 URL: https://issues.apache.org/jira/browse/HDFS-4754 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Attachments: 4754.v1.patch, 4754.v2.patch There is a detection of the stale datanodes in HDFS since HDFS-3703, with a timeout, defaulted to 30s. There are two reasons to add an API to mark a node as stale even if the timeout is not yet reached: 1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, we sometimes start the recovery before a node is marked staled. (even with reasonable settings as: stale: 20s; HBase ZK timeout: 30s 2) Some third parties could detect that a node is dead before the timeout, hence saving us the cost of retrying. An example or such hw is Arista, presented here by [~tsuna] http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in HBASE-6290. As usual, even if the node is dead it can comeback before the 10 minutes limit. So I would propose to set a timebound. The API would be namenode.markStale(String ipAddress, int port, long durationInMs); After durationInMs, the namenode would again rely only on its heartbeat to decide. Thoughts? If there is no objections, and if nobody in the hdfs dev team has the time to spend some time on it, I will give it a try for branch 2 3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4754) Add an API in the namenode to mark a datanode as stale
[ https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HDFS-4754: -- Attachment: 4754.v2.patch Add an API in the namenode to mark a datanode as stale -- Key: HDFS-4754 URL: https://issues.apache.org/jira/browse/HDFS-4754 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Attachments: 4754.v1.patch, 4754.v2.patch There is a detection of the stale datanodes in HDFS since HDFS-3703, with a timeout, defaulted to 30s. There are two reasons to add an API to mark a node as stale even if the timeout is not yet reached: 1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, we sometimes start the recovery before a node is marked staled. (even with reasonable settings as: stale: 20s; HBase ZK timeout: 30s 2) Some third parties could detect that a node is dead before the timeout, hence saving us the cost of retrying. An example or such hw is Arista, presented here by [~tsuna] http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in HBASE-6290. As usual, even if the node is dead it can comeback before the 10 minutes limit. So I would propose to set a timebound. The API would be namenode.markStale(String ipAddress, int port, long durationInMs); After durationInMs, the namenode would again rely only on its heartbeat to decide. Thoughts? If there is no objections, and if nobody in the hdfs dev team has the time to spend some time on it, I will give it a try for branch 2 3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4754) Add an API in the namenode to mark a datanode as stale
[ https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668338#comment-13668338 ] Nicolas Liochon commented on HDFS-4754: --- v2 takes the comments above into account. If the method is called with a duration of zero, we use the configuration stale node duration. Add an API in the namenode to mark a datanode as stale -- Key: HDFS-4754 URL: https://issues.apache.org/jira/browse/HDFS-4754 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Attachments: 4754.v1.patch, 4754.v2.patch There is a detection of the stale datanodes in HDFS since HDFS-3703, with a timeout, defaulted to 30s. There are two reasons to add an API to mark a node as stale even if the timeout is not yet reached: 1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, we sometimes start the recovery before a node is marked staled. (even with reasonable settings as: stale: 20s; HBase ZK timeout: 30s 2) Some third parties could detect that a node is dead before the timeout, hence saving us the cost of retrying. An example or such hw is Arista, presented here by [~tsuna] http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in HBASE-6290. As usual, even if the node is dead it can comeback before the 10 minutes limit. So I would propose to set a timebound. The API would be namenode.markStale(String ipAddress, int port, long durationInMs); After durationInMs, the namenode would again rely only on its heartbeat to decide. Thoughts? If there is no objections, and if nobody in the hdfs dev team has the time to spend some time on it, I will give it a try for branch 2 3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-4859) Add timeout in FileJournalManager
Kihwal Lee created HDFS-4859: Summary: Add timeout in FileJournalManager Key: HDFS-4859 URL: https://issues.apache.org/jira/browse/HDFS-4859 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Affects Versions: 2.0.4-alpha Reporter: Kihwal Lee Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4696) Branch 0.23 Patch for Block Replication Policy Implementation May Skip Higher-Priority Blocks for Lower-Priority Blocks
[ https://issues.apache.org/jira/browse/HDFS-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated HDFS-4696: Target Version/s: 0.23.9 (was: 0.23.8) Branch 0.23 Patch for Block Replication Policy Implementation May Skip Higher-Priority Blocks for Lower-Priority Blocks - Key: HDFS-4696 URL: https://issues.apache.org/jira/browse/HDFS-4696 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.5 Reporter: Derek Dagit Assignee: Derek Dagit This JIRA tracks the solution to HDFS-4366 for the 0.23 branch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4576) Webhdfs authentication issues
[ https://issues.apache.org/jira/browse/HDFS-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated HDFS-4576: Target Version/s: 3.0.0, 2.0.5-beta, 0.23.9 (was: 3.0.0, 2.0.5-beta, 0.23.8) Webhdfs authentication issues - Key: HDFS-4576 URL: https://issues.apache.org/jira/browse/HDFS-4576 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Affects Versions: 2.0.0-alpha, 3.0.0, 0.23.7 Reporter: Daryn Sharp Assignee: Daryn Sharp Umbrella jira to track the webhdfs authentication issues as subtasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4587) Webhdfs secure clients are incompatible with non-secure NN
[ https://issues.apache.org/jira/browse/HDFS-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated HDFS-4587: Target Version/s: 3.0.0, 2.0.5-beta, 0.23.9 (was: 3.0.0, 2.0.5-beta, 0.23.8) Webhdfs secure clients are incompatible with non-secure NN -- Key: HDFS-4587 URL: https://issues.apache.org/jira/browse/HDFS-4587 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, webhdfs Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Daryn Sharp A secure webhdfs client will receive an exception from a non-secure NN. For a NN in non-secure mode, {{FSNamesystem#getDelegationToken}} will return null to indicate no token is required. Hdfs will send back the null to the client, but webhdfs uses {{DelegationTokenSecretManager.createCredentials}} which instead throws an exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4780) Use the correct relogin method for services
[ https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668381#comment-13668381 ] Kihwal Lee commented on HDFS-4780: -- +1 the patch looks good. Use the correct relogin method for services --- Key: HDFS-4780 URL: https://issues.apache.org/jira/browse/HDFS-4780 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8 Reporter: Kihwal Lee Assignee: Robert Parker Priority: Minor Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch A number of components call reloginFromKeytab() before making requests. For StandbyCheckpointer and SecondaryNameNode, where this can be called frequently, it generates many WARN messages like this: WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds before. Other than these messages, it doesn't do anything wrong. But it will be nice if it is changed to call checkTGTAndReloginFromKeytab() to avoid the potentially misleading WARN messages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4780) Use the correct relogin method for services
[ https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-4780: - Resolution: Fixed Status: Resolved (was: Patch Available) Thanks for working on the patch, Rob. Use the correct relogin method for services --- Key: HDFS-4780 URL: https://issues.apache.org/jira/browse/HDFS-4780 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8 Reporter: Kihwal Lee Assignee: Robert Parker Priority: Minor Fix For: 3.0.0, 2.0.5-beta Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch A number of components call reloginFromKeytab() before making requests. For StandbyCheckpointer and SecondaryNameNode, where this can be called frequently, it generates many WARN messages like this: WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds before. Other than these messages, it doesn't do anything wrong. But it will be nice if it is changed to call checkTGTAndReloginFromKeytab() to avoid the potentially misleading WARN messages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4780) Use the correct relogin method for services
[ https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-4780: - Fix Version/s: 2.0.5-beta 3.0.0 Hadoop Flags: Reviewed I've committed this to trunk and branch-2. Use the correct relogin method for services --- Key: HDFS-4780 URL: https://issues.apache.org/jira/browse/HDFS-4780 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8 Reporter: Kihwal Lee Assignee: Robert Parker Priority: Minor Fix For: 3.0.0, 2.0.5-beta Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch A number of components call reloginFromKeytab() before making requests. For StandbyCheckpointer and SecondaryNameNode, where this can be called frequently, it generates many WARN messages like this: WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds before. Other than these messages, it doesn't do anything wrong. But it will be nice if it is changed to call checkTGTAndReloginFromKeytab() to avoid the potentially misleading WARN messages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4780) Use the correct relogin method for services
[ https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668387#comment-13668387 ] Daryn Sharp commented on HDFS-4780: --- The change to the callers to invoke the check version seems ok. It looks like the more appropriate UGI change may be to modify by {{reloginFromKeytab}} to switch the order of {{hasSufficientTimeElapsed}} (which generates the warning) and {{getRefreshTime}} which shorts out the renew attempt. On a related note, the two relogin methods appear equivalent aside from the {{hasSufficientTimeElapsed}} check. It would seem that only {{checkTGTAndReloginFromKeytab}} should be checking {{getRefreshTime}}. {{reloginFromKeytab}} should be probably be unconditionally re-acquiring a TGT, hence not checking {{hasSufficientTimeElapsed}}. Use the correct relogin method for services --- Key: HDFS-4780 URL: https://issues.apache.org/jira/browse/HDFS-4780 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8 Reporter: Kihwal Lee Assignee: Robert Parker Priority: Minor Fix For: 3.0.0, 2.0.5-beta Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch A number of components call reloginFromKeytab() before making requests. For StandbyCheckpointer and SecondaryNameNode, where this can be called frequently, it generates many WARN messages like this: WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds before. Other than these messages, it doesn't do anything wrong. But it will be nice if it is changed to call checkTGTAndReloginFromKeytab() to avoid the potentially misleading WARN messages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4780) Use the correct relogin method for services
[ https://issues.apache.org/jira/browse/HDFS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668388#comment-13668388 ] Hudson commented on HDFS-4780: -- Integrated in Hadoop-trunk-Commit #3794 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3794/]) HDFS-4780. Use the correct relogin method for services. Contributed by Robert Parker. (Revision 1486974) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1486974 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HftpFileSystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java Use the correct relogin method for services --- Key: HDFS-4780 URL: https://issues.apache.org/jira/browse/HDFS-4780 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0, 2.0.5-beta, 0.23.8 Reporter: Kihwal Lee Assignee: Robert Parker Priority: Minor Fix For: 3.0.0, 2.0.5-beta Attachments: HDFS-4780-branch0.23v1.patch, HDFS-4780v1.patch A number of components call reloginFromKeytab() before making requests. For StandbyCheckpointer and SecondaryNameNode, where this can be called frequently, it generates many WARN messages like this: WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds before. Other than these messages, it doesn't do anything wrong. But it will be nice if it is changed to call checkTGTAndReloginFromKeytab() to avoid the potentially misleading WARN messages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4860) Add additional attributes to JMX beans
[ https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4860: - Affects Version/s: (was: 2.0.4-alpha) 2.0.5-beta 3.0.0 0.20.204.1 Add additional attributes to JMX beans -- Key: HDFS-4860 URL: https://issues.apache.org/jira/browse/HDFS-4860 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 0.20.204.1, 3.0.0, 2.0.5-beta Reporter: Trevor Lorimer Priority: Trivial Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch Currently the JMX bean returns much of the data contained on the HDFS Health webpage (dfsHealth.html). However there are several other attributes that are required to be added. I intend to add the following items to the appropriate bean in parenthesis : Started time (NameNodeInfo), Compiled info (NameNodeInfo), Jvm MaxHeap, MaxNonHeap (JvmMetrics) Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo), Count of decommissioned Live and Dead nodes (FSNamesystemState), Journal Status (NodeNameInfo) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Moved] (HDFS-4860) Add additional attributes to JMX beans
[ https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik moved HADOOP-9596 to HDFS-4860: -- Component/s: (was: fs) namenode Affects Version/s: (was: 2.0.4-alpha) 2.0.4-alpha Key: HDFS-4860 (was: HADOOP-9596) Project: Hadoop HDFS (was: Hadoop Common) Add additional attributes to JMX beans -- Key: HDFS-4860 URL: https://issues.apache.org/jira/browse/HDFS-4860 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.4-alpha Reporter: Trevor Lorimer Priority: Trivial Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch Currently the JMX bean returns much of the data contained on the HDFS Health webpage (dfsHealth.html). However there are several other attributes that are required to be added. I intend to add the following items to the appropriate bean in parenthesis : Started time (NameNodeInfo), Compiled info (NameNodeInfo), Jvm MaxHeap, MaxNonHeap (JvmMetrics) Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo), Count of decommissioned Live and Dead nodes (FSNamesystemState), Journal Status (NodeNameInfo) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4860) Add additional attributes to JMX beans
[ https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4860: - Priority: Major (was: Trivial) Add additional attributes to JMX beans -- Key: HDFS-4860 URL: https://issues.apache.org/jira/browse/HDFS-4860 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 0.20.204.1, 3.0.0, 2.0.5-beta Reporter: Trevor Lorimer Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch Currently the JMX bean returns much of the data contained on the HDFS Health webpage (dfsHealth.html). However there are several other attributes that are required to be added. I intend to add the following items to the appropriate bean in parenthesis : Started time (NameNodeInfo), Compiled info (NameNodeInfo), Jvm MaxHeap, MaxNonHeap (JvmMetrics) Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo), Count of decommissioned Live and Dead nodes (FSNamesystemState), Journal Status (NodeNameInfo) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4860) Add additional attributes to JMX beans
[ https://issues.apache.org/jira/browse/HDFS-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668408#comment-13668408 ] Hadoop QA commented on HDFS-4860: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12584618/0001-Hadoop-namenode-JMX-metrics-update.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build///console This message is automatically generated. Add additional attributes to JMX beans -- Key: HDFS-4860 URL: https://issues.apache.org/jira/browse/HDFS-4860 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 0.20.204.1, 3.0.0, 2.0.5-beta Reporter: Trevor Lorimer Attachments: 0001-Hadoop-namenode-JMX-metrics-update.patch Currently the JMX bean returns much of the data contained on the HDFS Health webpage (dfsHealth.html). However there are several other attributes that are required to be added. I intend to add the following items to the appropriate bean in parenthesis : Started time (NameNodeInfo), Compiled info (NameNodeInfo), Jvm MaxHeap, MaxNonHeap (JvmMetrics) Node Usage stats (i.e. Min, Median, Max, stdev) (NameNodeInfo), Count of decommissioned Live and Dead nodes (FSNamesystemState), Journal Status (NodeNameInfo) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4754) Add an API in the namenode to mark a datanode as stale
[ https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668425#comment-13668425 ] Hadoop QA commented on HDFS-4754: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12585034/4754.v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4443//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4443//console This message is automatically generated. Add an API in the namenode to mark a datanode as stale -- Key: HDFS-4754 URL: https://issues.apache.org/jira/browse/HDFS-4754 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Attachments: 4754.v1.patch, 4754.v2.patch There is a detection of the stale datanodes in HDFS since HDFS-3703, with a timeout, defaulted to 30s. There are two reasons to add an API to mark a node as stale even if the timeout is not yet reached: 1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, we sometimes start the recovery before a node is marked staled. (even with reasonable settings as: stale: 20s; HBase ZK timeout: 30s 2) Some third parties could detect that a node is dead before the timeout, hence saving us the cost of retrying. An example or such hw is Arista, presented here by [~tsuna] http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in HBASE-6290. As usual, even if the node is dead it can comeback before the 10 minutes limit. So I would propose to set a timebound. The API would be namenode.markStale(String ipAddress, int port, long durationInMs); After durationInMs, the namenode would again rely only on its heartbeat to decide. Thoughts? If there is no objections, and if nobody in the hdfs dev team has the time to spend some time on it, I will give it a try for branch 2 3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
[ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668447#comment-13668447 ] Todd Lipcon commented on HDFS-4859: --- Hey Kihwal. Are you planning on using NFS based HA? I'd highly recommend using QJM instead -- it has timeout features and has been much more reliable for us in production clusters. Add timeout in FileJournalManager - Key: HDFS-4859 URL: https://issues.apache.org/jira/browse/HDFS-4859 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Affects Versions: 2.0.4-alpha Reporter: Kihwal Lee Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-4858: -- Target Version/s: 2.0.5-beta Fix Version/s: (was: 2.0.5-beta) (was: 3.0.0) HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.0.5-beta, 2.0.4-alpha, 2.0.4.1-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4849) Idempotent create, append and delete operations.
[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-4849: -- Issue Type: Improvement (was: Bug) Idempotent create, append and delete operations. Key: HDFS-4849 URL: https://issues.apache.org/jira/browse/HDFS-4849 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.4-alpha Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko create, append and delete operations can be made idempotent. This will reduce chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-4184) Add the ability for Client to provide more hint information for DataNode to manage the OS buffer cache more accurate
[ https://issues.apache.org/jira/browse/HDFS-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved HDFS-4184. --- Resolution: Duplicate Add the ability for Client to provide more hint information for DataNode to manage the OS buffer cache more accurate Key: HDFS-4184 URL: https://issues.apache.org/jira/browse/HDFS-4184 Project: Hadoop HDFS Issue Type: New Feature Reporter: binlijin HDFS now has the ability to use posix_fadvise and sync_data_range syscalls to manage the OS buffer cache. {code} When hbase read hlog the data we can set dfs.datanode.drop.cache.behind.reads to true to drop data out of the buffer cache when performing sequential reads. When hbase write hlog we can set dfs.datanode.drop.cache.behind.writes to true to drop data out of the buffer cache after writing When hbase read hfile during compaction we can set dfs.datanode.readahead.bytes to a non-zero value to trigger readahead for sequential reads, and also set dfs.datanode.drop.cache.behind.reads to true to drop data out of the buffer cache when performing sequential reads. and so on... {code} Current we can only set these feature global in datanode,we should set these feature per session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist
[ https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668488#comment-13668488 ] Jing Zhao commented on HDFS-4847: - Yes, I will check the snapshot document to make sure we mention .snapshot is not a valid directory. [~schu], I will mark this as invalid first. Feel free to create a new jira if you think to provide more accurate error msg to end users is necessary. hdfs dfs -count of a .snapshot directory fails claiming file does not exist --- Key: HDFS-4847 URL: https://issues.apache.org/jira/browse/HDFS-4847 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Stephen Chu Labels: snapshot, snapshots I successfully allow snapshots for /tmp and create three snapshots. I verify that the three snapshots are in /tmp/.snapshot. However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does not exist exception. Running -count on /tmp finds /tmp successfully. {code} schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot 2013-05-24 10:27:10,070 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items drwxr-xr-x - schu supergroup 0 2013-05-24 10:26 /tmp/.snapshot/s1 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s2 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s3 schu-mbp:~ schu$ hdfs dfs -count /tmp 2013-05-24 10:27:20,510 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 120 0 /tmp schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot 2013-05-24 10:27:30,397 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot 2013-05-24 10:28:23,252 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ {code} In the NN logs, I see: {code} 2013-05-24 10:27:30,857 INFO [IPC Server handler 6 on 8020] FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true ugi=schu (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/tmp/.snapshot dst=nullperm=null 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - PriviledgedActionException as:schu (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot 2013-05-24 10:27:30,891 INFO [IPC Server handler 7 on 8020] ipc.Server (Server.java:run(1864)) - IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: /tmp/.snapshot java.io.FileNotFoundException: File does not exist: /tmp/.snapshot at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2267) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:3188) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:829) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:726) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48057) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836) {code} Likewise, the _hdfs dfs du_ command fails with the same problem. Hadoop version: {code} schu-mbp:~ schu$ hadoop version Hadoop
[jira] [Resolved] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist
[ https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao resolved HDFS-4847. - Resolution: Invalid .snapshot is not a directory thus commands such as count and du do not work on path ending with .snapshot. hdfs dfs -count of a .snapshot directory fails claiming file does not exist --- Key: HDFS-4847 URL: https://issues.apache.org/jira/browse/HDFS-4847 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Stephen Chu Labels: snapshot, snapshots I successfully allow snapshots for /tmp and create three snapshots. I verify that the three snapshots are in /tmp/.snapshot. However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does not exist exception. Running -count on /tmp finds /tmp successfully. {code} schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot 2013-05-24 10:27:10,070 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items drwxr-xr-x - schu supergroup 0 2013-05-24 10:26 /tmp/.snapshot/s1 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s2 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s3 schu-mbp:~ schu$ hdfs dfs -count /tmp 2013-05-24 10:27:20,510 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 120 0 /tmp schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot 2013-05-24 10:27:30,397 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot 2013-05-24 10:28:23,252 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ {code} In the NN logs, I see: {code} 2013-05-24 10:27:30,857 INFO [IPC Server handler 6 on 8020] FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true ugi=schu (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/tmp/.snapshot dst=nullperm=null 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - PriviledgedActionException as:schu (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot 2013-05-24 10:27:30,891 INFO [IPC Server handler 7 on 8020] ipc.Server (Server.java:run(1864)) - IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: /tmp/.snapshot java.io.FileNotFoundException: File does not exist: /tmp/.snapshot at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2267) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:3188) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:829) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:726) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48057) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836) {code} Likewise, the _hdfs dfs du_ command fails with the same problem. Hadoop version: {code} schu-mbp:~ schu$ hadoop version Hadoop 3.0.0-SNAPSHOT Subversion git://github.com/apache/hadoop-common.git -r ccaf5ea09118eedbe17fd3f5b3f0c516221dd613 Compiled by schu on 2013-05-24T04:45Z From source with checksum
[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.
[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668497#comment-13668497 ] Colin Patrick McCabe commented on HDFS-4849: [~tlipcon] wrote: bq. Without per-file leases, a second thread trying to create a file already being created would end up getting back the same block ID and causing some havoc. One solution to this would be to include a nonce in the create() call, and store that in the INodeFileUnderConstruction, so that if you retry with the same nonce, it would identify it correctly as a retry. If leases are done by inode ID rather than by path, this problem goes away. I don't think delete can be made idempotent without changing the semantics in a major way. At that point, it would be impossible to do things like have FSShell tell you whether your rm actually deleted anything, etc. This isn't to mention things like concat... Idempotent create, append and delete operations. Key: HDFS-4849 URL: https://issues.apache.org/jira/browse/HDFS-4849 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.4-alpha Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko create, append and delete operations can be made idempotent. This will reduce chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4857) Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage
[ https://issues.apache.org/jira/browse/HDFS-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4857: Attachment: HDFS-4857.002.patch The failed test is also seen in HDFS-4840 and should be due to HDFS-3267. Update the patch to add extra check in the new unit test. Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage -- Key: HDFS-4857 URL: https://issues.apache.org/jira/browse/HDFS-4857 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: snapshots Attachments: HDFS-4857.001.patch, HDFS-4857.002.patch Snapshot.Root, though is a subclass of INodeDirectory, is only used to indicate the root of a snapshot. In the meanwhile, AbstractINodeDiff#snapshotINode is used as copies recording the original state of an INode. Thus we should not put them into INodeMap. Currently when loading FSImage we did not check the type of inode and wrongly put these two types of nodes into INodeMap. This may replace the nodes that should stay in INodeMap. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
[ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668529#comment-13668529 ] Colin Patrick McCabe commented on HDFS-4859: If the disk driver hangs on a synchronous {{write(2)}} or {{read(2)}}, it doesn't matter what the Java software did-- the operating system thread will be blocked. This is why we recommended that people soft-mount the NFS directory when using NFS HA. Todd's suggestion is the best, though-- just use QJM. Add timeout in FileJournalManager - Key: HDFS-4859 URL: https://issues.apache.org/jira/browse/HDFS-4859 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Affects Versions: 2.0.4-alpha Reporter: Kihwal Lee Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
[ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668534#comment-13668534 ] Kihwal Lee commented on HDFS-4859: -- We will certainly use one of the HA-enabled journal managers in the future, but many users I've talked to want NFS-based as a first step. Even if QJM is used for the shared edits directory, local or NFS may still be used for storing extra copy of edits (as non-required resource). In this case, lack of timeout in FJM can affect HA with manual failover. Can health checks used with ZKFC detect I/O hang? Add timeout in FileJournalManager - Key: HDFS-4859 URL: https://issues.apache.org/jira/browse/HDFS-4859 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Affects Versions: 2.0.4-alpha Reporter: Kihwal Lee Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4850) OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException
[ https://issues.apache.org/jira/browse/HDFS-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4850: Attachment: HDFS-4850.000.patch Initial patch fixing the reported bug. In my local test fsimage_008 can be read with the fix. Will add more tests for OfflineImageViewer and upload a new patch later. OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException - Key: HDFS-4850 URL: https://issues.apache.org/jira/browse/HDFS-4850 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 3.0.0 Reporter: Stephen Chu Assignee: Jing Zhao Labels: newbie Attachments: datadirs.tar.gz, fsimage_004, fsimage_008, HDFS-4850.000.patch, oiv_out_1, oiv_out_2 I deployed hadoop-trunk HDFS and created _/user/schu/_. I then forced a checkpoint, fetched the fsimage, and ran the default OfflineImageViewer successfully on the fsimage. {code} schu-mbp:~ schu$ hdfs oiv -i fsimage_004 -o oiv_out_1 schu-mbp:~ schu$ cat oiv_out_1 drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 / drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user/schu schu-mbp:~ schu$ {code} I then touched an empty file _/user/schu/testFile1_ {code} schu-mbp:~ schu$ hadoop fs -lsr / lsr: DEPRECATED: Please use 'ls -R' instead. drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user drwxr-xr-x - schu supergroup 0 2013-05-24 17:00 /user/schu -rw-r--r-- 1 schu supergroup 0 2013-05-24 17:00 /user/schu/testFile1 {code} and forced another checkpoint, fetched the fsimage, and reran the OfflineImageViewer. I encountered a NegativeArraySizeException: {code} schu-mbp:~ schu$ hdfs oiv -i fsimage_008 -o oiv_out_2 Input ended unexpectedly. 2013-05-24 17:01:13,622 ERROR [main] offlineImageViewer.OfflineImageViewer (OfflineImageViewer.java:go(140)) - image loading failed at offset 402 Exception in thread main java.lang.NegativeArraySizeException at org.apache.hadoop.io.Text.readString(Text.java:458) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processPermission(ImageLoaderCurrent.java:370) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINode(ImageLoaderCurrent.java:671) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processChildren(ImageLoaderCurrent.java:557) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:464) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processLocalNameINodesWithSnapshot(ImageLoaderCurrent.java:444) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINodes(ImageLoaderCurrent.java:398) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.loadImage(ImageLoaderCurrent.java:199) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.go(OfflineImageViewer.java:136) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.main(OfflineImageViewer.java:260) {code} This is reproducible. I've reproduced this scenario after formatting HDFS and restarting and touching an empty file _/testFile1_. Attached are the data dirs, the fsimage before creating the empty file (fsimage_004) and the fsimage afterwards (fsimage_004) and their outputs, oiv_out_1 and oiv_out_2 respectively. The oiv_out_2 does not include the empty _/user/schu/testFile1_. I don't run into this problem using hadoop-2.0.4-alpha. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4850) OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException
[ https://issues.apache.org/jira/browse/HDFS-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668517#comment-13668517 ] Jing Zhao commented on HDFS-4850: - Thanks for the testing and report Stephen! The error should be caused by a bug in ImageLoaderCurrent#processINode(...): {code}if (numBlocks 0){code} should be {code}if (numBlocks = 0){code}. I will upload a patch soon. Also, the OfflineImageViewer requires more unit tests to test its correctness with the existence of snapshots in FSImage. I will add those unit tests in the same patch. OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException - Key: HDFS-4850 URL: https://issues.apache.org/jira/browse/HDFS-4850 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 3.0.0 Reporter: Stephen Chu Assignee: Jing Zhao Labels: newbie Attachments: datadirs.tar.gz, fsimage_004, fsimage_008, oiv_out_1, oiv_out_2 I deployed hadoop-trunk HDFS and created _/user/schu/_. I then forced a checkpoint, fetched the fsimage, and ran the default OfflineImageViewer successfully on the fsimage. {code} schu-mbp:~ schu$ hdfs oiv -i fsimage_004 -o oiv_out_1 schu-mbp:~ schu$ cat oiv_out_1 drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 / drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user/schu schu-mbp:~ schu$ {code} I then touched an empty file _/user/schu/testFile1_ {code} schu-mbp:~ schu$ hadoop fs -lsr / lsr: DEPRECATED: Please use 'ls -R' instead. drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user drwxr-xr-x - schu supergroup 0 2013-05-24 17:00 /user/schu -rw-r--r-- 1 schu supergroup 0 2013-05-24 17:00 /user/schu/testFile1 {code} and forced another checkpoint, fetched the fsimage, and reran the OfflineImageViewer. I encountered a NegativeArraySizeException: {code} schu-mbp:~ schu$ hdfs oiv -i fsimage_008 -o oiv_out_2 Input ended unexpectedly. 2013-05-24 17:01:13,622 ERROR [main] offlineImageViewer.OfflineImageViewer (OfflineImageViewer.java:go(140)) - image loading failed at offset 402 Exception in thread main java.lang.NegativeArraySizeException at org.apache.hadoop.io.Text.readString(Text.java:458) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processPermission(ImageLoaderCurrent.java:370) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINode(ImageLoaderCurrent.java:671) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processChildren(ImageLoaderCurrent.java:557) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:464) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processLocalNameINodesWithSnapshot(ImageLoaderCurrent.java:444) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINodes(ImageLoaderCurrent.java:398) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.loadImage(ImageLoaderCurrent.java:199) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.go(OfflineImageViewer.java:136) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.main(OfflineImageViewer.java:260) {code} This is reproducible. I've reproduced this scenario after formatting HDFS and restarting and touching an empty file _/testFile1_. Attached are the data dirs, the fsimage before creating the empty file (fsimage_004) and the fsimage afterwards (fsimage_004) and their outputs, oiv_out_1 and oiv_out_2 respectively. The oiv_out_2 does not include the empty _/user/schu/testFile1_. I don't run into this problem using hadoop-2.0.4-alpha. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-4850) OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException
[ https://issues.apache.org/jira/browse/HDFS-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao reassigned HDFS-4850: --- Assignee: Jing Zhao OfflineImageViewer fails on fsimage with empty file because of NegativeArraySizeException - Key: HDFS-4850 URL: https://issues.apache.org/jira/browse/HDFS-4850 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 3.0.0 Reporter: Stephen Chu Assignee: Jing Zhao Labels: newbie Attachments: datadirs.tar.gz, fsimage_004, fsimage_008, oiv_out_1, oiv_out_2 I deployed hadoop-trunk HDFS and created _/user/schu/_. I then forced a checkpoint, fetched the fsimage, and ran the default OfflineImageViewer successfully on the fsimage. {code} schu-mbp:~ schu$ hdfs oiv -i fsimage_004 -o oiv_out_1 schu-mbp:~ schu$ cat oiv_out_1 drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 / drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user/schu schu-mbp:~ schu$ {code} I then touched an empty file _/user/schu/testFile1_ {code} schu-mbp:~ schu$ hadoop fs -lsr / lsr: DEPRECATED: Please use 'ls -R' instead. drwxr-xr-x - schu supergroup 0 2013-05-24 16:59 /user drwxr-xr-x - schu supergroup 0 2013-05-24 17:00 /user/schu -rw-r--r-- 1 schu supergroup 0 2013-05-24 17:00 /user/schu/testFile1 {code} and forced another checkpoint, fetched the fsimage, and reran the OfflineImageViewer. I encountered a NegativeArraySizeException: {code} schu-mbp:~ schu$ hdfs oiv -i fsimage_008 -o oiv_out_2 Input ended unexpectedly. 2013-05-24 17:01:13,622 ERROR [main] offlineImageViewer.OfflineImageViewer (OfflineImageViewer.java:go(140)) - image loading failed at offset 402 Exception in thread main java.lang.NegativeArraySizeException at org.apache.hadoop.io.Text.readString(Text.java:458) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processPermission(ImageLoaderCurrent.java:370) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINode(ImageLoaderCurrent.java:671) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processChildren(ImageLoaderCurrent.java:557) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:464) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processDirectoryWithSnapshot(ImageLoaderCurrent.java:470) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processLocalNameINodesWithSnapshot(ImageLoaderCurrent.java:444) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.processINodes(ImageLoaderCurrent.java:398) at org.apache.hadoop.hdfs.tools.offlineImageViewer.ImageLoaderCurrent.loadImage(ImageLoaderCurrent.java:199) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.go(OfflineImageViewer.java:136) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer.main(OfflineImageViewer.java:260) {code} This is reproducible. I've reproduced this scenario after formatting HDFS and restarting and touching an empty file _/testFile1_. Attached are the data dirs, the fsimage before creating the empty file (fsimage_004) and the fsimage afterwards (fsimage_004) and their outputs, oiv_out_1 and oiv_out_2 respectively. The oiv_out_2 does not include the empty _/user/schu/testFile1_. I don't run into this problem using hadoop-2.0.4-alpha. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-4855) DFSOutputStream reference should be cleared from DFSClient#filesBeingWritten if the file closure fails.
[ https://issues.apache.org/jira/browse/HDFS-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe resolved HDFS-4855. Resolution: Duplicate DFSOutputStream reference should be cleared from DFSClient#filesBeingWritten if the file closure fails. --- Key: HDFS-4855 URL: https://issues.apache.org/jira/browse/HDFS-4855 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.0.4-alpha Reporter: Vinay Assignee: Vinay Fix For: 2.0.4.1-alpha If the file closure fails due to some exception, then {{DFSOutputStream}} reference should be removed from the {{DFSClient#filesBeingWritten}}. Which is useless to keep and also memory consuming. If the same client is being used for long time,, then there is a chance of client getting OOM due to this. fix would be simple. Just cover complete {{DFSOutputStream#close()}} under try-finally and move {{dfsClient.endFileLease(src);}} to finally block. Any thoughts..? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
[ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668555#comment-13668555 ] Colin Patrick McCabe commented on HDFS-4859: If the NameNode hangs, ZKFC will detect it. Add timeout in FileJournalManager - Key: HDFS-4859 URL: https://issues.apache.org/jira/browse/HDFS-4859 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Affects Versions: 2.0.4-alpha Reporter: Kihwal Lee Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4857) Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage
[ https://issues.apache.org/jira/browse/HDFS-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668604#comment-13668604 ] Hadoop QA commented on HDFS-4857: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12585056/HDFS-4857.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4445//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4445//console This message is automatically generated. Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage -- Key: HDFS-4857 URL: https://issues.apache.org/jira/browse/HDFS-4857 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: snapshots Attachments: HDFS-4857.001.patch, HDFS-4857.002.patch Snapshot.Root, though is a subclass of INodeDirectory, is only used to indicate the root of a snapshot. In the meanwhile, AbstractINodeDiff#snapshotINode is used as copies recording the original state of an INode. Thus we should not put them into INodeMap. Currently when loading FSImage we did not check the type of inode and wrongly put these two types of nodes into INodeMap. This may replace the nodes that should stay in INodeMap. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4827) Slight update to the implementation of API for handling favored nodes in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668641#comment-13668641 ] Hudson commented on HDFS-4827: -- Integrated in Hadoop-trunk-Commit #3795 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3795/]) HDFS-4827. Slight update to the implementation of API for handling favored nodes in DFSClient. Contributed by Devaraj Das. (Revision 1487093) Result = SUCCESS ddas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1487093 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java Slight update to the implementation of API for handling favored nodes in DFSClient -- Key: HDFS-4827 URL: https://issues.apache.org/jira/browse/HDFS-4827 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.5-beta Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 2.0.5-beta Attachments: hdfs-4827-1.txt Currently, the favoredNodes flavor of the DFSClient.create implementation does a call to _inetSocketAddressInstance.getAddress().getHostAddress()_ This wouldn't work if the inetSocketAddressInstance is unresolved (instance created via InetSocketAddress.createUnresolved()). The DFSClient API should handle both cases of favored-nodes' InetSocketAddresses (resolved/unresolved) passed to it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4827) Slight update to the implementation of API for handling favored nodes in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated HDFS-4827: -- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk branch-2. Slight update to the implementation of API for handling favored nodes in DFSClient -- Key: HDFS-4827 URL: https://issues.apache.org/jira/browse/HDFS-4827 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.5-beta Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 2.0.5-beta Attachments: hdfs-4827-1.txt Currently, the favoredNodes flavor of the DFSClient.create implementation does a call to _inetSocketAddressInstance.getAddress().getHostAddress()_ This wouldn't work if the inetSocketAddressInstance is unresolved (instance created via InetSocketAddress.createUnresolved()). The DFSClient API should handle both cases of favored-nodes' InetSocketAddresses (resolved/unresolved) passed to it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-4861) BlockPlacementPolicyDefault does not consider decommissioning nodes
Kihwal Lee created HDFS-4861: Summary: BlockPlacementPolicyDefault does not consider decommissioning nodes Key: HDFS-4861 URL: https://issues.apache.org/jira/browse/HDFS-4861 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.7, 2.0.5-beta Reporter: Kihwal Lee getMaxNodesPerRack() calculates the max replicas/rack like this: {code} int maxNodesPerRack = (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2; {code} Since this does not consider the racks that are being decommissioned and the decommissioning state is only checked later in isGoodTarget(), certain blocks are not replicated even when there are many racks and nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4861) BlockPlacementPolicyDefault does not consider decommissioning racks
[ https://issues.apache.org/jira/browse/HDFS-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-4861: - Summary: BlockPlacementPolicyDefault does not consider decommissioning racks (was: BlockPlacementPolicyDefault does not consider decommissioning nodes) BlockPlacementPolicyDefault does not consider decommissioning racks --- Key: HDFS-4861 URL: https://issues.apache.org/jira/browse/HDFS-4861 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.7, 2.0.5-beta Reporter: Kihwal Lee getMaxNodesPerRack() calculates the max replicas/rack like this: {code} int maxNodesPerRack = (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2; {code} Since this does not consider the racks that are being decommissioned and the decommissioning state is only checked later in isGoodTarget(), certain blocks are not replicated even when there are many racks and nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4754) Add an API in the namenode to mark a datanode as stale
[ https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668718#comment-13668718 ] Nick Dimiduk commented on HDFS-4754: bq. There is a setting to disable it if needed, by configuring a max duration to zero. Can this be documented somewhere, {{hdfs-default.xml}} for example? Add an API in the namenode to mark a datanode as stale -- Key: HDFS-4754 URL: https://issues.apache.org/jira/browse/HDFS-4754 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Attachments: 4754.v1.patch, 4754.v2.patch There is a detection of the stale datanodes in HDFS since HDFS-3703, with a timeout, defaulted to 30s. There are two reasons to add an API to mark a node as stale even if the timeout is not yet reached: 1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, we sometimes start the recovery before a node is marked staled. (even with reasonable settings as: stale: 20s; HBase ZK timeout: 30s 2) Some third parties could detect that a node is dead before the timeout, hence saving us the cost of retrying. An example or such hw is Arista, presented here by [~tsuna] http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in HBASE-6290. As usual, even if the node is dead it can comeback before the 10 minutes limit. So I would propose to set a timebound. The API would be namenode.markStale(String ipAddress, int port, long durationInMs); After durationInMs, the namenode would again rely only on its heartbeat to decide. Thoughts? If there is no objections, and if nobody in the hdfs dev team has the time to spend some time on it, I will give it a try for branch 2 3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
[ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668772#comment-13668772 ] Kihwal Lee commented on HDFS-4859: -- bq. If the NameNode hangs, ZKFC will detect it. I understand that ZKFC will detect the failures if NN does not respond to RPC calls or the internal resource check fails. If all RPC handlers are waiting for a very long logSync() to finish, this may be detected as well. But if a couple of handlers are in trouble due to I/O hang and all others are happily serving reads, the error condition may not be detected in time. The situation will be different, of course, if the underlying journal flush can timeout. I think adding timeout will still be useful since users can run combination of a HA-JM and FJM. Ideally, NN should be able to detect and exclude failed storages with a predictable/configurable latency, regardless of underlying implementation. Add timeout in FileJournalManager - Key: HDFS-4859 URL: https://issues.apache.org/jira/browse/HDFS-4859 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Affects Versions: 2.0.4-alpha Reporter: Kihwal Lee Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-4862) SafeModeInfo.isManual() returns true when resources are low even if it wasn't entered into manually
Ravi Prakash created HDFS-4862: -- Summary: SafeModeInfo.isManual() returns true when resources are low even if it wasn't entered into manually Key: HDFS-4862 URL: https://issues.apache.org/jira/browse/HDFS-4862 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.4-alpha, 0.23.7, 3.0.0 Reporter: Ravi Prakash HDFS-1594 changed isManual to this {code} private boolean isManual() { return extension == Integer.MAX_VALUE !resourcesLow; } {code} One immediate impact of this is that when resources are low, the NN will throw away all block reports from DNs. This is undesirable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage
Jing Zhao created HDFS-4863: --- Summary: The root directory should be added to the snapshottable directory list while loading fsimage Key: HDFS-4863 URL: https://issues.apache.org/jira/browse/HDFS-4863 Project: Hadoop HDFS Issue Type: Bug Reporter: Jing Zhao Assignee: Jing Zhao When the root directory is set as snapshottable, its snapshot quota is changed from 0 to a positive number. While loading fsimage we should check the root's snapshot quota and add it to snapshottable directory list in SnapshotManager if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4848) copyFromLocal and renaming a file to .snapshot should output that .snapshot is a reserved name
[ https://issues.apache.org/jira/browse/HDFS-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4848: Attachment: HDFS-4848.000.patch The uploaded patch catches and re-throws the IllegalNameException when we find that .snapshot is used as the target name. I manually tested it in a local cluster and we can get the expected .snapshot is a reserved name exception msg with the patch. I will add one more unit test to make sure the rename undo section still works. copyFromLocal and renaming a file to .snapshot should output that .snapshot is a reserved name -- Key: HDFS-4848 URL: https://issues.apache.org/jira/browse/HDFS-4848 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Stephen Chu Assignee: Jing Zhao Priority: Trivial Labels: snapshot, snapshots Attachments: HDFS-4848.000.patch Might be an unlikely scenario, but if users copyFromLocal a file/dir from local to HDFS and want the file/dir to be renamed to _.snapshot_, then they'll see an Input/output error like the following: {code} schu-mbp:~ schu$ hdfs dfs -copyFromLocal testFile1 /tmp/.snapshot copyFromLocal: rename `/tmp/.snapshot._COPYING_' to `/tmp/.snapshot': Input/output error {code} It'd be more clear if the error output was the usual _.snapshot is a reserved name_ (this does show in the NN logs, though). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage
[ https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4863: Affects Version/s: 3.0.0 The root directory should be added to the snapshottable directory list while loading fsimage - Key: HDFS-4863 URL: https://issues.apache.org/jira/browse/HDFS-4863 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao When the root directory is set as snapshottable, its snapshot quota is changed from 0 to a positive number. While loading fsimage we should check the root's snapshot quota and add it to snapshottable directory list in SnapshotManager if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage
[ https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4863: Component/s: snapshots The root directory should be added to the snapshottable directory list while loading fsimage - Key: HDFS-4863 URL: https://issues.apache.org/jira/browse/HDFS-4863 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao When the root directory is set as snapshottable, its snapshot quota is changed from 0 to a positive number. While loading fsimage we should check the root's snapshot quota and add it to snapshottable directory list in SnapshotManager if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage
[ https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4863: Labels: snapshots (was: ) The root directory should be added to the snapshottable directory list while loading fsimage - Key: HDFS-4863 URL: https://issues.apache.org/jira/browse/HDFS-4863 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: snapshots When the root directory is set as snapshottable, its snapshot quota is changed from 0 to a positive number. While loading fsimage we should check the root's snapshot quota and add it to snapshottable directory list in SnapshotManager if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage
[ https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-4863: Attachment: HDFS-4863.001.patch A patch based on HDFS-4857. The root directory should be added to the snapshottable directory list while loading fsimage - Key: HDFS-4863 URL: https://issues.apache.org/jira/browse/HDFS-4863 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: snapshots Attachments: HDFS-4863.001.patch When the root directory is set as snapshottable, its snapshot quota is changed from 0 to a positive number. While loading fsimage we should check the root's snapshot quota and add it to snapshottable directory list in SnapshotManager if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist
[ https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668835#comment-13668835 ] Aaron T. Myers commented on HDFS-4847: -- Hey Jing, not sure I agree with this reasoning. Why shouldn't `hadoop fs -count' work on a '.snapshot' pseudo-directory, just as it does on a real directory? I'd think it would just add up all of the files/space consumed in all of the snapshots under that pseudo-directory and report that back. I'd think that basically all read-only commands should work, much as they do in the special directories '.zfs' and '.snapshot' in ZFS and WAFL, respectively. Also, note that, contrary to your last comment, `hadoop fs -du' does appear to currently work on a '.snapshot' directory: {noformat} $ hadoop fs -du .snapshot 3338 .snapshot/s20130528-165940.694 3338 .snapshot/s20130528-170045.101 3338 .snapshot/s20130528-170828.222 {noformat} Though this output is not quite correct, since in these snapshots only the last one actually contains any files which have non-zero space, but they're all showing 3338 bytes consumed: {noformat} $ hadoop fs -ls .snapshot/* Found 2 items drwxr-xr-x - atm atm 0 2013-05-28 16:56 .snapshot/s20130528-165940.694/bar drwxr-xr-x - atm atm 0 2013-05-28 16:56 .snapshot/s20130528-165940.694/foo Found 2 items drwxr-xr-x - atm atm 0 2013-05-28 16:56 .snapshot/s20130528-170045.101/bar drwxr-xr-x - atm atm 0 2013-05-28 16:56 .snapshot/s20130528-170045.101/foo Found 3 items -rw-r--r-- 1 atm atm 3338 2013-05-28 17:08 .snapshot/s20130528-170828.222/.bashrc drwxr-xr-x - atm atm 0 2013-05-28 16:56 .snapshot/s20130528-170828.222/bar drwxr-xr-x - atm atm 0 2013-05-28 16:56 .snapshot/s20130528-170828.222/foo {noformat} hdfs dfs -count of a .snapshot directory fails claiming file does not exist --- Key: HDFS-4847 URL: https://issues.apache.org/jira/browse/HDFS-4847 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Stephen Chu Labels: snapshot, snapshots I successfully allow snapshots for /tmp and create three snapshots. I verify that the three snapshots are in /tmp/.snapshot. However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does not exist exception. Running -count on /tmp finds /tmp successfully. {code} schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot 2013-05-24 10:27:10,070 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items drwxr-xr-x - schu supergroup 0 2013-05-24 10:26 /tmp/.snapshot/s1 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s2 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s3 schu-mbp:~ schu$ hdfs dfs -count /tmp 2013-05-24 10:27:20,510 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 120 0 /tmp schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot 2013-05-24 10:27:30,397 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot 2013-05-24 10:28:23,252 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ {code} In the NN logs, I see: {code} 2013-05-24 10:27:30,857 INFO [IPC Server handler 6 on 8020] FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true ugi=schu (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/tmp/.snapshot dst=nullperm=null 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - PriviledgedActionException as:schu (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot 2013-05-24 10:27:30,891 INFO [IPC Server handler 7 on 8020] ipc.Server (Server.java:run(1864)) - IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: /tmp/.snapshot java.io.FileNotFoundException: File does not exist: /tmp/.snapshot at
[jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
[ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668900#comment-13668900 ] Colin Patrick McCabe commented on HDFS-4859: The Linux kernel doesn't allow you to set a timeout on I/O operations, unless you use O_DIRECT and async I/O. If operations with the local filesystem take longer than you would like, what are you going to have the NameNode do? Kill itself? It can't even kill itself if it is hung on a write, because the process will be in D state, otherwise known as uninterruptible sleep. In this scenario, the NameNode worker thread will be blocked forever, probably while holding the FSImage lock. There is nothing you can do. You can't kill the thread, and even if you could, how would you get the mutex back? There is nothing Java can do when the OS decides your thread cannot run. The solution to your problem is that you can easily set a timeout on NFS operations by using a soft mount plus {{timeo=60}} (or whatever timeout you want). Add timeout in FileJournalManager - Key: HDFS-4859 URL: https://issues.apache.org/jira/browse/HDFS-4859 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Affects Versions: 2.0.4-alpha Reporter: Kihwal Lee Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4832) Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave
[ https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated HDFS-4832: --- Attachment: HDFS-4832.patch Thanks for your review Kihwal. I've updated the patch. bq. isInStartupSafeMode() returns true for any auto safe mode. E.g. if the resource checker puts NN in safe mode, it will return true. I have filed HDFS-4862 to fix this. The method name is unfortunately contrary to its behavior. {quote} The existing code drained scheduled work in safe mode, but the patch makes it immediately stops sending scheduled work to DNs. This seems correct behavior for safe mode, but those work can be sent out after leaving safe mode. That may not be ideal. E.g. if NN is suffering from a flaky DNS, DNs will appear dead, come back and dead again, generating a lot of invalidation and replication work. Admins may put NN in safe mode to safely pass the storm. When they do, the unnecessary work needs to stop rather than being delayed. Please make sure unintended damage does not occur after leaving safe mode. {quote} UnderReplicatedBlocks is the priority queue maintained for neededReplications, and it is updated when nodes join or are marked dead. However, once BlockManager.computeReplicationWorkForBlocks is called, the ReplicationWork is transferred to the DatanodeDescriptor's replicateBlocks queue, from which it will not be rescinded. The computeReplicationWorkForBlocks() is called every replicationRecheckInterval which defaults to 3 seconds. Can we please handle this in a separate JIRA? Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave - Key: HDFS-4832 URL: https://issues.apache.org/jira/browse/HDFS-4832 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Ravi Prakash Assignee: Ravi Prakash Priority: Critical Attachments: HDFS-4832.patch, HDFS-4832.patch, HDFS-4832.patch Courtesy Karri VRK Reddy! {quote} 1. Namenode lost datanodes causing missing blocks 2. Namenode was put in safe mode 3. Datanode restarted on dead nodes 4. Waited for lots of time for the NN UI to reflect the recovered blocks. 5. Forced NN out of safe mode and suddenly, no more missing blocks anymore. {quote} I was able to replicate this on 0.23 and trunk. I set dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate lost datanode. The opposite case also has problems (i.e. Datanode failing when NN is in safemode, doesn't lead to a missing blocks message) Without the NN updating this list of missing blocks, the grid admins will not know when to take the cluster out of safemode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4832) Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave
[ https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668973#comment-13668973 ] Hadoop QA commented on HDFS-4832: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12585143/HDFS-4832.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestFSNamesystem {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4446//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4446//console This message is automatically generated. Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave - Key: HDFS-4832 URL: https://issues.apache.org/jira/browse/HDFS-4832 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Ravi Prakash Assignee: Ravi Prakash Priority: Critical Attachments: HDFS-4832.patch, HDFS-4832.patch, HDFS-4832.patch Courtesy Karri VRK Reddy! {quote} 1. Namenode lost datanodes causing missing blocks 2. Namenode was put in safe mode 3. Datanode restarted on dead nodes 4. Waited for lots of time for the NN UI to reflect the recovered blocks. 5. Forced NN out of safe mode and suddenly, no more missing blocks anymore. {quote} I was able to replicate this on 0.23 and trunk. I set dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate lost datanode. The opposite case also has problems (i.e. Datanode failing when NN is in safemode, doesn't lead to a missing blocks message) Without the NN updating this list of missing blocks, the grid admins will not know when to take the cluster out of safemode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4842) Snapshot: identify the correct prior snapshot when deleting a snapshot under a renamed subtree
[ https://issues.apache.org/jira/browse/HDFS-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-4842: - Hadoop Flags: Reviewed +1 patch looks good. Snapshot: identify the correct prior snapshot when deleting a snapshot under a renamed subtree -- Key: HDFS-4842 URL: https://issues.apache.org/jira/browse/HDFS-4842 Project: Hadoop HDFS Issue Type: Sub-task Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-4842.000.patch, HDFS-4842.001.patch, HDFS-4842.002.patch In our long-term running tests for snapshot we find the following bug: 1. initially we have directories /test, /test/dir1 and /test/dir2/foo. 2. first take snapshot s0 and s1 on /test. 3. modify some descendant of foo (e.g., delete foo/bar/file), to make sure some changes have been recorded to the snapshot diff associated with s1. 4. take snapshot s2 on /test/dir2 5. move foo from dir2 to dir1, i.e., rename /test/dir2/foo to /test/dir1/foo 6. delete snapshot s1 After step 6, the snapshot copy of foo/bar/file should have been merged from s1 to s0 (i.e., s0 should be identified as the prior snapshot of s1). However, the current code failed to identify the correct prior snapshot in the source tree of the rename operation and wrongly used s2 as the prior snapshot. The bug only exists when nested snapshottable directories are enabled. To fix the bug, we need to go upwards in the source tree of the rename operation (i.e., dir2) to identify the correct prior snapshot in the above scenario. This jira will fix the bug and add several corresponding unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.
[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668993#comment-13668993 ] Konstantin Shvachko commented on HDFS-4849: --- Given Matthew's comment I think I should have provided more motivation for the issue first. The idea to make these changes comes from the desire to have MR and YARN run the jobs without interruption in HA case. Today if NameNode dies and failover to StandbyNode occurs some jobs can fail. This mostly depends on whether failure of NN happened during idempotent or non-idempotent operation. Idempotent operations, like getBlockLocations or addBlock, are retried and the client will eventually complete such operation via StandbyNode, when SBN becomes active. Non-idempotent operations like create and delete are not retried, they just fail. Therefore, MR job fails if it tries to create an output file for a reducer or delete a directory at cleanup stage just at the moment NN crashes. While if it could retry the create on SBN, it would have succeeded. So we might need to compromize and loozen the semantics of some HDFS operations in order to satisfy stricter availabilty and scalability requirements. And we better do it now before APIs are frozen for branch 2. Idempotent create, append and delete operations. Key: HDFS-4849 URL: https://issues.apache.org/jira/browse/HDFS-4849 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.4-alpha Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko create, append and delete operations can be made idempotent. This will reduce chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4857) Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage
[ https://issues.apache.org/jira/browse/HDFS-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-4857: - Hadoop Flags: Reviewed +1 patch looks good. Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage -- Key: HDFS-4857 URL: https://issues.apache.org/jira/browse/HDFS-4857 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: snapshots Attachments: HDFS-4857.001.patch, HDFS-4857.002.patch Snapshot.Root, though is a subclass of INodeDirectory, is only used to indicate the root of a snapshot. In the meanwhile, AbstractINodeDiff#snapshotINode is used as copies recording the original state of an INode. Thus we should not put them into INodeMap. Currently when loading FSImage we did not check the type of inode and wrongly put these two types of nodes into INodeMap. This may replace the nodes that should stay in INodeMap. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.
[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669000#comment-13669000 ] Konstantin Shvachko commented on HDFS-4849: --- Steve, you got a good example. Rename is the most tricky operation in file systems. And concurrency logic is always an issue in distributed storage. Suppose in current HDFS client1 renames (moves) file A to B when B exists, which should replace B with contents of A. Suppose then that at the same time client2 deletes file B. Since there is no guarantee which operation is executed first you can either end up with A renamed to B, if the delete goes first, or with no files if the rename prevails followed by the deletion of B. This is similar to your case. When client1 retries from its perspective delete was not completed, so it deletes again. And it is not different from the case when client1 is slow and executes delete after rename. Or if there are other clients besides 1 and 2 doing something with /path. My point is that if you need to coordinate clients you should do it with some external tools, like ZK. Idempotent create, append and delete operations. Key: HDFS-4849 URL: https://issues.apache.org/jira/browse/HDFS-4849 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.4-alpha Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko create, append and delete operations can be made idempotent. This will reduce chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4863) The root directory should be added to the snapshottable directory list while loading fsimage
[ https://issues.apache.org/jira/browse/HDFS-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669004#comment-13669004 ] Tsz Wo (Nicholas), SZE commented on HDFS-4863: -- How about we remove the first if-statement and combine the non-root case to the second if-statement? I.e. {code} if (numSnapshots = 0) { final INodeDirectorySnapshottable snapshottableParent = INodeDirectorySnapshottable.valueOf(parent, parent.getLocalName()); // load snapshots and snapshotQuota SnapshotFSImageFormat.loadSnapshotList(snapshottableParent, numSnapshots, in, this); if (snapshottableParent.getSnapshotQuota() 0) { this.namesystem.getSnapshotManager().addSnapshottable( snapshottableParent); } } {code} The root directory should be added to the snapshottable directory list while loading fsimage - Key: HDFS-4863 URL: https://issues.apache.org/jira/browse/HDFS-4863 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: snapshots Attachments: HDFS-4863.001.patch When the root directory is set as snapshottable, its snapshot quota is changed from 0 to a positive number. While loading fsimage we should check the root's snapshot quota and add it to snapshottable directory list in SnapshotManager if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.
[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669007#comment-13669007 ] Konstantin Shvachko commented on HDFS-4849: --- A given client may have multiple threads race to create the same file, and those threads would share the same client name (and hence lease). If you go through FileSystem API that should not happen, because DFSClient.clientName includes the thread name. Yes, I can write a test which directly makes RPC calls to NN and fakes them with the same clientName for different threads. But this is not public HDFS APIs, and there is many other ways to abuse the system. Should we care? the only solution I could come up with is something like NFS's duplicate request cache Great point Todd. This would have been a universal solution for deletes, renames, even concat. Was thinking about it too and thought this would complicate things. I see only one issue with delete that prevents it from being idempotent - it's the return value, which must be true only if the deleted object existed and was actually deleted. This cannot be guaranteed through retries. The semantics of delete should be that _object does not exist after delete completes_. This seems idempotent to me. The return value should be treated as success or failure. Same as in mkdir. Idempotent create, append and delete operations. Key: HDFS-4849 URL: https://issues.apache.org/jira/browse/HDFS-4849 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.4-alpha Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko create, append and delete operations can be made idempotent. This will reduce chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4847) hdfs dfs -count of a .snapshot directory fails claiming file does not exist
[ https://issues.apache.org/jira/browse/HDFS-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669010#comment-13669010 ] Suresh Srinivas commented on HDFS-4847: --- bq. Also, note that, contrary to your last comment, `hadoop fs -du' does appear to currently work on a '.snapshot' directory: I would rather turn it off. I think .snapshot is not a directory. Lets not try to make it like one for administrative commands and add complexity. But if anyone wants to pursue it, lets do it in a separate jira. As it stand currently, the functionality expected by this jira is not supported. hdfs dfs -count of a .snapshot directory fails claiming file does not exist --- Key: HDFS-4847 URL: https://issues.apache.org/jira/browse/HDFS-4847 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Stephen Chu Labels: snapshot, snapshots I successfully allow snapshots for /tmp and create three snapshots. I verify that the three snapshots are in /tmp/.snapshot. However, when I attempt _hdfs dfs -count /tmp/.snapshot_ I get a file does not exist exception. Running -count on /tmp finds /tmp successfully. {code} schu-mbp:~ schu$ hadoop fs -ls /tmp/.snapshot 2013-05-24 10:27:10,070 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items drwxr-xr-x - schu supergroup 0 2013-05-24 10:26 /tmp/.snapshot/s1 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s2 drwxr-xr-x - schu supergroup 0 2013-05-24 10:27 /tmp/.snapshot/s3 schu-mbp:~ schu$ hdfs dfs -count /tmp 2013-05-24 10:27:20,510 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 120 0 /tmp schu-mbp:~ schu$ hdfs dfs -count /tmp/.snapshot 2013-05-24 10:27:30,397 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ hdfs dfs -count -q /tmp/.snapshot 2013-05-24 10:28:23,252 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable count: File does not exist: /tmp/.snapshot schu-mbp:~ schu$ {code} In the NN logs, I see: {code} 2013-05-24 10:27:30,857 INFO [IPC Server handler 6 on 8020] FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) - allowed=true ugi=schu (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/tmp/.snapshot dst=nullperm=null 2013-05-24 10:27:30,891 ERROR [IPC Server handler 7 on 8020] security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - PriviledgedActionException as:schu (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /tmp/.snapshot 2013-05-24 10:27:30,891 INFO [IPC Server handler 7 on 8020] ipc.Server (Server.java:run(1864)) - IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getContentSummary from 127.0.0.1:49738: error: java.io.FileNotFoundException: File does not exist: /tmp/.snapshot java.io.FileNotFoundException: File does not exist: /tmp/.snapshot at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2267) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:3188) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:829) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:726) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48057) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489) at
[jira] [Commented] (HDFS-4846) Snapshot CLI commands output stacktrace for invalid arguments
[ https://issues.apache.org/jira/browse/HDFS-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669013#comment-13669013 ] Tsz Wo (Nicholas), SZE commented on HDFS-4846: -- For the admin commands, we indeed should not unwrapRemoteException in DFSClient allowSnapshot()/disallowSnapshot(). Snapshot CLI commands output stacktrace for invalid arguments - Key: HDFS-4846 URL: https://issues.apache.org/jira/browse/HDFS-4846 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0 Reporter: Stephen Chu Assignee: Jing Zhao Priority: Minor Labels: snapshot Attachments: HDFS-4846.001.patch, HDFS-4846.002.patch, HDFS-4846.003.patch It'd be useful to clean up the stacktraces output by the snapshot CLI commands when the commands are used incorrectly. This will make things more readable for operators and hopefully prevent confusion. Allowing a snapshot on a directory that doesn't exist {code} schu-mbp:~ schu$ hdfs dfsadmin -allowSnapshot adfasdf 2013-05-23 15:46:46.052 java[24580:1203] Unable to load realm info from SCDynamicStore 2013-05-23 15:46:46,066 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable allowSnapshot: Directory does not exist: /user/schu/adfasdf at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.valueOf(INodeDirectory.java:52) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.setSnapshottable(SnapshotManager.java:106) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.allowSnapshot(FSNamesystem.java:5861) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.allowSnapshot(NameNodeRpcServer.java:1121) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.allowSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:932) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48087) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836) schu-mbp:~ schu$ {code} Disallow a snapshot on a directory that isn't snapshottable {code} schu-mbp:~ schu$ hdfs dfsadmin -disallowSnapshot /user 2013-05-23 15:49:07.251 java[24687:1203] Unable to load realm info from SCDynamicStore 2013-05-23 15:49:07,265 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable disallowSnapshot: Directory is not a snapshottable directory: /user at org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.valueOf(INodeDirectorySnapshottable.java:68) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.resetSnapshottable(SnapshotManager.java:151) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.disallowSnapshot(FSNamesystem.java:5889) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.disallowSnapshot(NameNodeRpcServer.java:1128) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.disallowSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:943) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48089) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1842) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1838) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1836) {code} Snapshot diffs with non-existent snapshot paths
[jira] [Commented] (HDFS-4848) copyFromLocal and renaming a file to .snapshot should output that .snapshot is a reserved name
[ https://issues.apache.org/jira/browse/HDFS-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669015#comment-13669015 ] Tsz Wo (Nicholas), SZE commented on HDFS-4848: -- For /.reserved, FSDirectory.addChild(..) throws HadoopIllegalArgumentException. Let's throw the same exception for .snapshot. copyFromLocal and renaming a file to .snapshot should output that .snapshot is a reserved name -- Key: HDFS-4848 URL: https://issues.apache.org/jira/browse/HDFS-4848 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Stephen Chu Assignee: Jing Zhao Priority: Trivial Labels: snapshot, snapshots Attachments: HDFS-4848.000.patch Might be an unlikely scenario, but if users copyFromLocal a file/dir from local to HDFS and want the file/dir to be renamed to _.snapshot_, then they'll see an Input/output error like the following: {code} schu-mbp:~ schu$ hdfs dfs -copyFromLocal testFile1 /tmp/.snapshot copyFromLocal: rename `/tmp/.snapshot._COPYING_' to `/tmp/.snapshot': Input/output error {code} It'd be more clear if the error output was the usual _.snapshot is a reserved name_ (this does show in the NN logs, though). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.
[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669018#comment-13669018 ] Suresh Srinivas commented on HDFS-4849: --- bq. because DFSClient.clientName includes the thread name... How do you guarantee RPC call ID + client name is unique? bq. The semantics of delete should be that object does not exist after delete completes. This seems idempotent to me. This definition is not complete. Slightly rephrasing uniquely identified object does not exist after delete completes. In this regard, any deletion that identifies the object using path, which is not unique will not work. Between two retries, if another client creates the path being deleted, second retry could delete a file that should not be deleted. I think fileID/inodeID recently introduced can make delete idempotent, in cases where client knows about the file ID of the file. This will not work for deletions based on path alone. Idempotent create, append and delete operations. Key: HDFS-4849 URL: https://issues.apache.org/jira/browse/HDFS-4849 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.4-alpha Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko create, append and delete operations can be made idempotent. This will reduce chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4845) FSEditLogLoader gets NPE while accessing INodeMap in TestEditLogRace
[ https://issues.apache.org/jira/browse/HDFS-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669019#comment-13669019 ] Tsz Wo (Nicholas), SZE commented on HDFS-4845: -- It looks like loadInodeWithLocalName is only invoked during namenode startup, except for tests so a lock may not be necessary. However I added it for completeness. I think it should add the lock in the tests. Otherwise, it would slow down startup. FSEditLogLoader gets NPE while accessing INodeMap in TestEditLogRace Key: HDFS-4845 URL: https://issues.apache.org/jira/browse/HDFS-4845 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0 Reporter: Kihwal Lee Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-4845.001.patch, HDFS-4845.002.patch TestEditLogRace fails occasionally because it gets NPE from manipulating INodeMap while loading edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira