[jira] [Assigned] (HDFS-3265) PowerPc Build error.
[ https://issues.apache.org/jira/browse/HDFS-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3265: - Assignee: Kumar Ravi PowerPc Build error. Key: HDFS-3265 URL: https://issues.apache.org/jira/browse/HDFS-3265 Project: Hadoop HDFS Issue Type: Bug Components: build Affects Versions: 1.0.2, 1.0.3, 2.0.0 Environment: Linux RHEL 6.1 PowerPC + IBM JVM 6.0 SR10 Reporter: Kumar Ravi Assignee: Kumar Ravi Labels: patch Attachments: HADOOP-8271.patch, HADOOP-8271.patch Original Estimate: 168h Remaining Estimate: 168h When attempting to build branch-1, the following error is seen and ant exits. [exec] configure: error: Unsupported CPU architecture powerpc64 The following command was used to build hadoop-common ant -Dlibhdfs=true -Dcompile.native=true -Dfusedfs=true -Dcompile.c++=true -Dforrest.home=$FORREST_HOME compile-core-native compile-c++ compile-c++-examples task-controller tar record-parser compile-hdfs-classes package -Djava5.home=/opt/ibm/ibm-java2-ppc64-50/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3243) TestParallelRead timing out on jenkins
[ https://issues.apache.org/jira/browse/HDFS-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3243: - Assignee: Henry Robinson TestParallelRead timing out on jenkins -- Key: HDFS-3243 URL: https://issues.apache.org/jira/browse/HDFS-3243 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, test Reporter: Todd Lipcon Assignee: Henry Robinson Trunk builds have been failing recently due to a TestParallelRead timeout. It doesn't report in the Jenkins failure list because surefire handles timeouts really poorly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3159) Document NN auto-failover setup and configuration
[ https://issues.apache.org/jira/browse/HDFS-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3159: - Assignee: Todd Lipcon Document NN auto-failover setup and configuration - Key: HDFS-3159 URL: https://issues.apache.org/jira/browse/HDFS-3159 Project: Hadoop HDFS Issue Type: Sub-task Components: auto-failover, documentation, ha Affects Versions: Auto failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon We should document how to configure, set up, and monitor an automatic failover setup. This will require adding the new configs to the *-default.xml and adding prose to the apt docs as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3207) Mechanism for HA failover to ignore fencing errors
[ https://issues.apache.org/jira/browse/HDFS-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3207: - Assignee: Todd Lipcon Mechanism for HA failover to ignore fencing errors -- Key: HDFS-3207 URL: https://issues.apache.org/jira/browse/HDFS-3207 Project: Hadoop HDFS Issue Type: New Feature Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Philip Zeyliger Assignee: Todd Lipcon If an administrator wants to ignore fencing (perhaps because he knows that the other namenode has been taken out of comission for sure, but the fencing is ssh-based), there should be a flag to the failover command to allow this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3102) Add CLI tool to initialize the shared-edits dir
[ https://issues.apache.org/jira/browse/HDFS-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3102: - Assignee: Aaron T. Myers Add CLI tool to initialize the shared-edits dir --- Key: HDFS-3102 URL: https://issues.apache.org/jira/browse/HDFS-3102 Project: Hadoop HDFS Issue Type: Improvement Components: ha, name-node Affects Versions: 0.24.0, 0.23.3 Reporter: Todd Lipcon Assignee: Aaron T. Myers Currently in order to make a non-HA NN HA, you need to initialize the shared edits dir. This can be done manually by cping directories around. It would be preferable to add a namenode -initializeSharedEdits command to achieve this same effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3072) haadmin should have configurable timeouts for failover commands
[ https://issues.apache.org/jira/browse/HDFS-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3072: - Assignee: Todd Lipcon haadmin should have configurable timeouts for failover commands --- Key: HDFS-3072 URL: https://issues.apache.org/jira/browse/HDFS-3072 Project: Hadoop HDFS Issue Type: Improvement Components: ha Affects Versions: 0.24.0 Reporter: Philip Zeyliger Assignee: Todd Lipcon The HAAdmin failover could should time out reasonably aggressively and go onto the fencing strategies if it's dealing with a mostly dead active namenode. Currently it uses what's probably the default, which is to say no timeout whatsoever. {code} /** * Return a proxy to the specified target service. */ protected HAServiceProtocol getProtocol(String serviceId) throws IOException { String serviceAddr = getServiceAddr(serviceId); InetSocketAddress addr = NetUtils.createSocketAddr(serviceAddr); return (HAServiceProtocol)RPC.getProxy( HAServiceProtocol.class, HAServiceProtocol.versionID, addr, getConf()); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3084) FenceMethod.tryFence() and ShellCommandFencer should pass namenodeId as well as host:port
[ https://issues.apache.org/jira/browse/HDFS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3084: - Assignee: Todd Lipcon FenceMethod.tryFence() and ShellCommandFencer should pass namenodeId as well as host:port - Key: HDFS-3084 URL: https://issues.apache.org/jira/browse/HDFS-3084 Project: Hadoop HDFS Issue Type: Improvement Components: ha Affects Versions: 0.24.0, 0.23.3 Reporter: Philip Zeyliger Assignee: Todd Lipcon The FenceMethod interface passes along the host:port of the NN that needs to be fenced. That's great for the common case. However, it's likely necessary to have extra configuration parameters for fencing, and these are typically keyed off the nameserviceId.namenodeId (if, for nothing else, consistency with all the other parameters that are keyed off of namespaceId.namenodeId). Obviously this can be backed out from the host:port, but it's inconvenient, and requires iterating through all the configs. The shell interface exhibits the same issue: host:port is great for most fencers, but if you need extra configs (like the host:port of the power supply unit), those are harder to pipe through without the namenodeId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3083) HA+security: failed to run a mapred job from yarn after a manual failover
[ https://issues.apache.org/jira/browse/HDFS-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3083: - Assignee: Aaron T. Myers HA+security: failed to run a mapred job from yarn after a manual failover - Key: HDFS-3083 URL: https://issues.apache.org/jira/browse/HDFS-3083 Project: Hadoop HDFS Issue Type: Bug Components: ha, security Affects Versions: 0.24.0, 0.23.3 Reporter: Mingjie Lai Assignee: Aaron T. Myers Priority: Critical Fix For: 0.24.0, 0.23.3 Steps to reproduce: - turned on ha and security - run a mapred job, and wait to finish - failover to another namenode - run the mapred job again, it fails. Checking the job delegation token, it still indicates the original active namenode, and causes nm failed to obtain a dt for the new active nn. (?) {code} $ hdfs dfs -cat hdfs://ns1:8020/tmp/hadoop-yarn/staging/yarn/.staging/job_1331619043691_0001/appTokens HDTS ha-hdfs:ns1@(yarn/nn1.hadoop.local@HADOOP.LOCALDOMAINyarn�6 �L��6.�ЛFs��r�%�B�'��{pR�HDFS_DELEGATION_TOKEN ha-hdfs:ns {code} Exceptions: {code} 12/03/13 06:19:44 INFO mapred.ResourceMgrDelegate: Submitted application application_1331619043691_0002 to ResourceManager at nn1.hadoop.local/10.177.23.38:7090 12/03/13 06:19:45 INFO mapreduce.Job: The url to track the job: http://nn1.hadoop.local:7050/proxy/application_1331619043691_0002/ 12/03/13 06:19:45 INFO mapreduce.Job: Running job: job_1331619043691_0002 12/03/13 06:19:47 INFO mapreduce.Job: Job job_1331619043691_0002 running in uber mode : false 12/03/13 06:19:47 INFO mapreduce.Job: map 0% reduce 0% 12/03/13 06:19:47 INFO mapreduce.Job: Job job_1331619043691_0002 failed with state FAILED due to: Application application_1331619043691_0002 failed 1 times due to AM Container for appattempt_1331619043691_0002_01 exited with exitCode: -1000 due to: RemoteTrace: org.apache.hadoop.security.token.SecretManager$InvalidToken: token (HDFS_DELEGATION_TOKEN token 40 for yarn) can't be found in cache at org.apache.hadoop.ipc.Client.call(Client.java:1159) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:188) at $Proxy28.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:622) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at $Proxy29.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1260) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:718) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:88) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: token (HDFS_DELEGATION_TOKEN token 40 for yarn) can't be found in cache at
[jira] [Assigned] (HDFS-3081) SshFenceByTcpPort uses netcat incorrectly
[ https://issues.apache.org/jira/browse/HDFS-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3081: - Assignee: Todd Lipcon SshFenceByTcpPort uses netcat incorrectly - Key: HDFS-3081 URL: https://issues.apache.org/jira/browse/HDFS-3081 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 0.24.0 Reporter: Philip Zeyliger Assignee: Todd Lipcon SshFencyByTcpPort currently assumes that the NN is listening on localhost. Typical setups have the namenode listening just on the hostname of the namenode, which would lead nc -z to not catch it. Here's an example in which the NN is running, listening on 8020, but doesn't respond to localhost 8020. {noformat} [root@xxx ~]# lsof -P -p 5286 | grep -i listen java5286 root 110u IPv41772357 TCP xxx:8020 (LISTEN) java5286 root 121u IPv41772397 TCP xxx:50070 (LISTEN) [root@xxx ~]# nc -z localhost 8020 [root@xxx ~]# nc -z xxx 8020 Connection to xxx 8020 port [tcp/intu-ec-svcdisc] succeeded! {noformat} Here's the likely offending code: {code} LOG.info( Indeterminate response from trying to kill service. + Verifying whether it is running using nc...); rc = execCommand(session, nc -z localhost 8020); {code} Naively, we could rely on netcat to the correct hostname (since the NN ought to be listening on the hostname it's configured as), or just to use fuser. Fuser catches ports independently of what IPs they're bound to: {noformat} [root@xxx ~]# fuser 1234/tcp 1234/tcp: 6766 6768 [root@xxx ~]# jobs [1]- Running nc -l localhost 1234 [2]+ Running nc -l rhel56-18.ent.cloudera.com 1234 [root@xxx ~]# sudo lsof -P | grep -i LISTEN | grep -i 1234 nc 6766 root3u IPv42563626 TCP localhost:1234 (LISTEN) nc 6768 root3u IPv42563671 TCP xxx:1234 (LISTEN) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3062) Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up by job submission.
[ https://issues.apache.org/jira/browse/HDFS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3062: - Assignee: Mingjie Lai (was: Todd Lipcon) Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up by job submission. Key: HDFS-3062 URL: https://issues.apache.org/jira/browse/HDFS-3062 Project: Hadoop HDFS Issue Type: Bug Components: ha, security Affects Versions: 0.24.0 Reporter: Mingjie Lai Assignee: Mingjie Lai Priority: Critical Fix For: 0.24.0 Attachments: HDFS-3062-trunk.patch When testing the combination of NN HA + security + yarn, I found that the mapred job submission cannot pick up the logic URI of a nameservice. I have logic URI configured in core-site.xml {code} property namefs.defaultFS/name valuehdfs://ns1/value /property {code} HDFS client can work with the HA deployment/configs: {code} [root@nn1 hadoop]# hdfs dfs -ls / Found 6 items drwxr-xr-x - hbase hadoop 0 2012-03-07 20:42 /hbase drwxrwxrwx - yarn hadoop 0 2012-03-07 20:42 /logs drwxr-xr-x - mapred hadoop 0 2012-03-07 20:42 /mapred drwxr-xr-x - mapred hadoop 0 2012-03-07 20:42 /mr-history drwxrwxrwt - hdfs hadoop 0 2012-03-07 21:57 /tmp drwxr-xr-x - hdfs hadoop 0 2012-03-07 20:42 /user {code} but cannot submit a mapred job with security turned on {code} [root@nn1 hadoop]# /usr/lib/hadoop/bin/yarn --config ./conf jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar randomwriter out Running 0 maps. Job started: Wed Mar 07 23:28:23 UTC 2012 java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:431) at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:312) at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:217) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:119) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137) at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:411) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:326) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1221) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) {code}0.24 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3071) haadmin failover command does not provide enough detail for when target NN is not ready to be active
[ https://issues.apache.org/jira/browse/HDFS-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3071: - Assignee: Todd Lipcon haadmin failover command does not provide enough detail for when target NN is not ready to be active Key: HDFS-3071 URL: https://issues.apache.org/jira/browse/HDFS-3071 Project: Hadoop HDFS Issue Type: Improvement Components: ha Affects Versions: 0.24.0 Reporter: Philip Zeyliger Assignee: Todd Lipcon When running the failover command, you can get an error message like the following: {quote} $ hdfs --config $(pwd) haadmin -failover namenode2 namenode1 Failover failed: xxx.yyy/1.2.3.4:8020 is not ready to become active {quote} Unfortunately, the error message doesn't describe why that node isn't ready to be active. In my case, the target namenode's logs don't indicate anything either. It turned out that the issue was Safe mode is ON.Resources are low on NN. Safe mode must be turned off manually., but ideally the user would be told that at the time of the failover. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2185) HA: ZK-based FailoverController
[ https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2185: - Assignee: Todd Lipcon (was: Bikas Saha) HA: ZK-based FailoverController --- Key: HDFS-2185 URL: https://issues.apache.org/jira/browse/HDFS-2185 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Todd Lipcon This jira is for a ZK-based FailoverController daemon. The FailoverController is a separate daemon from the NN that does the following: * Initiates leader election (via ZK) when necessary * Performs health monitoring (aka failure detection) * Performs fail-over (standby to active and active to standby transitions) * Heartbeats to ensure the liveness It should have the same/similar interface as the Linux HA RM to aid pluggability. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3038) Add FSEditLog.metrics to findbugs exclude list
[ https://issues.apache.org/jira/browse/HDFS-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-3038: - Assignee: Todd Lipcon Add FSEditLog.metrics to findbugs exclude list -- Key: HDFS-3038 URL: https://issues.apache.org/jira/browse/HDFS-3038 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.24.0, 0.23.3 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Trivial Attachments: hdfs-3038.txt to fix a findbugs error on trunk -- this field is only re-set by tests, no need to be worried about its synchronization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2731) HA: Autopopulate standby name dirs if they're empty
[ https://issues.apache.org/jira/browse/HDFS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2731: - Assignee: Todd Lipcon (was: Eli Collins) HA: Autopopulate standby name dirs if they're empty --- Key: HDFS-2731 URL: https://issues.apache.org/jira/browse/HDFS-2731 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Todd Lipcon To setup a SBN we currently format the primary then manually copy the name dirs to the SBN. The SBN should do this automatically. Specifically, on NN startup, if HA with a shared edits dir is configured and populated, if the SBN has empty name dirs it should downloads the image and log from the primary (as an optimization it could copy the logs from the shared dir). If the other NN is still in standby then it should fail to start as it does currently. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2948) HA: NN throws NPE during shutdown if it fails to startup
[ https://issues.apache.org/jira/browse/HDFS-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2948: - Assignee: Todd Lipcon HA: NN throws NPE during shutdown if it fails to startup Key: HDFS-2948 URL: https://issues.apache.org/jira/browse/HDFS-2948 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Minor Attachments: hdfs-2948.txt Last night's nightly build had a bunch of NPEs thrown in NameNode.stop. Not sure which patch introduced the issue, but the problem is that NameNode.stop() is called if an exception is thrown during startup. If the exception is thrown before the namesystem is created, then NameNode.namesystem is null, and {{namesystem.stop}} throws NPE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2920) HA: fix remaining TODO items
[ https://issues.apache.org/jira/browse/HDFS-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2920: - Assignee: Todd Lipcon HA: fix remaining TODO items Key: HDFS-2920 URL: https://issues.apache.org/jira/browse/HDFS-2920 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Reporter: Eli Collins Assignee: Todd Lipcon There are a number of TODO(HA) and TODO:HA comments we need to fix or remove. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2878) TestBlockRecovery does not compile
[ https://issues.apache.org/jira/browse/HDFS-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2878: - Assignee: Todd Lipcon TestBlockRecovery does not compile -- Key: HDFS-2878 URL: https://issues.apache.org/jira/browse/HDFS-2878 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.23.1 Reporter: Eli Collins Assignee: Todd Lipcon Priority: Blocker Attachments: hdfs-2878.txt Looks like HDFS-2563 introduced a compilation error in TestBlockRecovery. We didn't catch this because of HDFS-2876. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2935) Shared edits dir property should be suffixed with nameservice and namenodeID
[ https://issues.apache.org/jira/browse/HDFS-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2935: - Assignee: Todd Lipcon Shared edits dir property should be suffixed with nameservice and namenodeID Key: HDFS-2935 URL: https://issues.apache.org/jira/browse/HDFS-2935 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Vinithra Varadharajan Assignee: Todd Lipcon Similar to the NameNode's name dirs, we should also be able to specify the shared edits dir as dfs.namenode.shared.edits.dir.nameserviceId.nnId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2924) Standby checkpointing fails to authenticate in secure cluster
[ https://issues.apache.org/jira/browse/HDFS-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2924: - Assignee: Todd Lipcon Standby checkpointing fails to authenticate in secure cluster - Key: HDFS-2924 URL: https://issues.apache.org/jira/browse/HDFS-2924 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node, security Affects Versions: HA branch (HDFS-1623) Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical When running HA on a secure cluster, the SBN checkpointing process doesn't seem to pick up the keytab-based credentials for its RPC connection to the active. I think we're just missing a doAs() in the right spot. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2794) HA: Active NN may purge edit log files before standby NN has a chance to read them
[ https://issues.apache.org/jira/browse/HDFS-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2794: - Assignee: Todd Lipcon (was: Aaron T. Myers) HA: Active NN may purge edit log files before standby NN has a chance to read them -- Key: HDFS-2794 URL: https://issues.apache.org/jira/browse/HDFS-2794 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon Given that the active NN is solely responsible for purging finalized edit log segments, and given that the active NN has no way of knowing when the standby reads edit logs, it's possible that the standby NN could fail to read all edits it needs before the active purges the files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2874) HA: edit log should log to shared dirs before local dirs
[ https://issues.apache.org/jira/browse/HDFS-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2874: - Assignee: Todd Lipcon HA: edit log should log to shared dirs before local dirs Key: HDFS-2874 URL: https://issues.apache.org/jira/browse/HDFS-2874 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Currently, the NN logs its edits to each of its edits directories in sequence. This can produce the following bad sequence: - NN accumulates 100 edits (tx 1-100) in the buffer. Writes and syncs to local drive, then crashes - Failover occurs. SBN takes over at txid=1, since txid 1 never got writen. - First NN restarts. It reads up to txid 100 from its local directories. It is now ahead of the active NN with inconsistent state. The solution is to write to the shared edits dir, and sync that, before writing to any local drives. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2185) HA: ZK-based FailoverController
[ https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2185: - Assignee: Bikas Saha (was: Todd Lipcon) HA: ZK-based FailoverController --- Key: HDFS-2185 URL: https://issues.apache.org/jira/browse/HDFS-2185 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Bikas Saha This jira is for a ZK-based FailoverController daemon. The FailoverController is a separate daemon from the NN that does the following: * Initiates leader election (via ZK) when necessary * Performs health monitoring (aka failure detection) * Performs fail-over (standby to active and active to standby transitions) * Heartbeats to ensure the liveness It should have the same/similar interface as the Linux HA RM to aid pluggability. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2579) Starting delegation token manager during safemode fails
[ https://issues.apache.org/jira/browse/HDFS-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2579: - Assignee: Todd Lipcon Starting delegation token manager during safemode fails --- Key: HDFS-2579 URL: https://issues.apache.org/jira/browse/HDFS-2579 Project: Hadoop HDFS Issue Type: Bug Components: name-node, security Affects Versions: 0.23.0 Reporter: Todd Lipcon Assignee: Todd Lipcon I noticed this on the HA branch, but it seems to actually affect non-HA branch 0.23 if security is enabled. When the NN starts up, if security is enabled, we start the delegation token secret manager, which then tries to call {{logUpdateMasterKey}}. This fails because the edit logs may not be written while in safe-mode. It seems to me that there's not any necessary reason that you have to make a new master key at startup, since you've loaded the old key when you load the FSImage. You'd only be lacking a DT master key on a fresh cluster, in which case we could have it generate one at format time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2810) Leases not properly getting renewed by clients
[ https://issues.apache.org/jira/browse/HDFS-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2810: - Assignee: Todd Lipcon Leases not properly getting renewed by clients -- Key: HDFS-2810 URL: https://issues.apache.org/jira/browse/HDFS-2810 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.23.0, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical We've been testing HBase on clusters running trunk and seen an issue where they seem to lose their HDFS leases after a couple of hours of runtime. We don't quite have enough data to understand what's happening, but the NN is expiring them, claiming the hard lease period has elapsed. The clients report no error until their output stream gets killed underneath them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2803) Adding logging to LeaseRenewer for better lease expiration triage.
[ https://issues.apache.org/jira/browse/HDFS-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2803: - Assignee: Jimmy Xiang Adding logging to LeaseRenewer for better lease expiration triage. -- Key: HDFS-2803 URL: https://issues.apache.org/jira/browse/HDFS-2803 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Labels: newbie Attachments: hdfs-2803.txt It will be helpful to add some logging to LeaseRenewer when the daemon is terminated (Info level), and when the lease is renewed (Debug level). Since lacking logging, it is hard to know if a DFS client doesn't renew the lease because it hangs, or the lease renewer daemon is gone somehow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2795: - Assignee: Todd Lipcon (was: Aaron T. Myers) HA: Standby NN takes a long time to recover from a dead DN starting up -- Key: HDFS-2795 URL: https://issues.apache.org/jira/browse/HDFS-2795 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon Priority: Critical To reproduce: # Start an HA cluster with a DN. # Write several blocks to the FS with replication 1. # Shutdown the DN # Wait for the NNs to declare the DN dead. All blocks will be under-replicated. # Restart the DN. Note that upon restarting the DN, the active NN will immediately get all block locations from the initial BR. The standby NN will not, and instead will slowly add block locations for a subset of the previously-missing blocks on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2767) HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol
[ https://issues.apache.org/jira/browse/HDFS-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2767: - Assignee: Todd Lipcon HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol --- Key: HDFS-2767 URL: https://issues.apache.org/jira/browse/HDFS-2767 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, hdfs client Affects Versions: HA branch (HDFS-1623) Reporter: Uma Maheswara Rao G Assignee: Todd Lipcon Priority: Blocker Presentely ConfiguredFailoverProxyProvider supports ClinetProtocol. It should support NameNodeProtocol also, because Balancer uses NameNodeProtocol for getting blocks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2724) NN web UI can throw NPE after startup, before standby state is entered
[ https://issues.apache.org/jira/browse/HDFS-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2724: - Assignee: Todd Lipcon (was: Eli Collins) NN web UI can throw NPE after startup, before standby state is entered -- Key: HDFS-2724 URL: https://issues.apache.org/jira/browse/HDFS-2724 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon There's a brief period of time (a few seconds) after the NN web server has been initialized, but before the NN's HA state is initialized. If {{dfshealth.jsp}} is hit during this time, a {{NullPointerException}} will be thrown. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2733) Document HA configuration and CLI
[ https://issues.apache.org/jira/browse/HDFS-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2733: - Assignee: Todd Lipcon Document HA configuration and CLI - Key: HDFS-2733 URL: https://issues.apache.org/jira/browse/HDFS-2733 Project: Hadoop HDFS Issue Type: Sub-task Components: documentation, ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Todd Lipcon We need to document the configuration changes in HDFS-2231 and the new CLI introduced by HADOOP-7774. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2762) TestCheckpoint is timing out
[ https://issues.apache.org/jira/browse/HDFS-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2762: - Assignee: Uma Maheswara Rao G (was: Todd Lipcon) TestCheckpoint is timing out Key: HDFS-2762 URL: https://issues.apache.org/jira/browse/HDFS-2762 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Uma Maheswara Rao G TestCheckpoint is timing out on the HA branch, and has been for a few days. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2738) FSEditLog.selectinputStreams is reading through in-progress streams even when non-in-progress are requested
[ https://issues.apache.org/jira/browse/HDFS-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2738: - Assignee: Aaron T. Myers (was: Todd Lipcon) FSEditLog.selectinputStreams is reading through in-progress streams even when non-in-progress are requested --- Key: HDFS-2738 URL: https://issues.apache.org/jira/browse/HDFS-2738 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Todd Lipcon Assignee: Aaron T. Myers Priority: Critical The new code in HDFS-1580 is causing an issue with selectInputStreams in the HA context. When the active is writing to the shared edits, selectInputStreams is called on the standby. This ends up calling {{journalSet.getInputStream}} but doesn't pass the {{inProgressOk=false}} flag. So, {{getInputStream}} ends up reading and validating the in-progress stream unnecessarily. Since the validation results are no longer properly cached, {{findMaxTransaction}} also re-validates the in-progress stream, and then breaks the corruption check in this code. The end result is a lot of errors like: 2011-12-30 16:45:02,521 ERROR namenode.FileJournalManager (FileJournalManager.java:getNumberOfTransactions(266)) - Gap in transactions, max txnid is 579, 0 txns from 578 2011-12-30 16:45:02,521 INFO ha.EditLogTailer (EditLogTailer.java:run(163)) - Got error, will try again. java.io.IOException: No non-corrupt logs for txid 578 at org.apache.hadoop.hdfs.server.namenode.JournalSet.getInputStream(JournalSet.java:229) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1081) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:115) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$0(EditLogTailer.java:100) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:154) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2693) Synchronization issues around state transition
[ https://issues.apache.org/jira/browse/HDFS-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2693: - Assignee: Todd Lipcon Synchronization issues around state transition -- Key: HDFS-2693 URL: https://issues.apache.org/jira/browse/HDFS-2693 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Currently when the NN changes state, it does so without synchronization. In particular, the state transition function does: (1) leave old state (2) change state variable (3) enter new state This means that the NN is marked as active before it has actually transitioned to active mode and opened its edit logs. This gives a window where write transactions can come in and the {{checkOperation}} allows them, but then they fail because the edit log is not yet opened. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-1314) dfs.block.size accepts only absolute value
[ https://issues.apache.org/jira/browse/HDFS-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-1314: - Assignee: Sho Shimauchi (was: Anwar Abdus-Samad) Reassigning to Sho since there hasn't been any progress on this and he wanted to try it. dfs.block.size accepts only absolute value -- Key: HDFS-1314 URL: https://issues.apache.org/jira/browse/HDFS-1314 Project: Hadoop HDFS Issue Type: Bug Reporter: Karim Saadah Assignee: Sho Shimauchi Priority: Minor Labels: newbie Using dfs.block.size=8388608 works but dfs.block.size=8mb does not. Using dfs.block.size=8mb should throw some WARNING on NumberFormatException. (http://pastebin.corp.yahoo.com/56129) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2626) HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized
[ https://issues.apache.org/jira/browse/HDFS-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2626: - Assignee: Todd Lipcon HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized - Key: HDFS-2626 URL: https://issues.apache.org/jira/browse/HDFS-2626 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hdfs-2626.txt When starting an HA blockpool with both namenodes up, I often see an NPE, referenced by one of the TODOs. The issue is the following interleaving: - first BPActor registers, and sets bpNSInfo in BPOfferService. It then proceeds to initFsDataset which takes a little bit of time - second BPActor registers, and sees bpNSInfo is non-null, then proceeds to heartbeat loop. Meanwhile BPActor 1 is still initting FSDataset - second BPActor gets an NPE on first heartbeat since fsdataset is still null. We just need to synchronize that function to fix the NPE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-1972) HA: Datanode fencing mechanism
[ https://issues.apache.org/jira/browse/HDFS-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-1972: - Assignee: Todd Lipcon (was: Suresh Srinivas) Going to assign this to myself since I'm actively working on it. Suresh, if you had started with any design or code, feel free to post what you've got and I'll take it from there. HA: Datanode fencing mechanism -- Key: HDFS-1972 URL: https://issues.apache.org/jira/browse/HDFS-1972 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, name-node Reporter: Suresh Srinivas Assignee: Todd Lipcon In high availability setup, with an active and standby namenode, there is a possibility of two namenodes sending commands to the datanode. The datanode must honor commands from only the active namenode and reject the commands from standby, to prevent corruption. This invariant must be complied with during fail over and other states such as split brain. This jira addresses issues related to this, design of the solution and implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2185) HA: ZK-based FailoverController
[ https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2185: - Assignee: Todd Lipcon (was: Eli Collins) HA: ZK-based FailoverController --- Key: HDFS-2185 URL: https://issues.apache.org/jira/browse/HDFS-2185 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Eli Collins Assignee: Todd Lipcon This jira is for a ZK-based FailoverController daemon. The FailoverController is a separate daemon from the NN that does the following: * Initiates leader election (via ZK) when necessary * Performs health monitoring (aka failure detection) * Performs fail-over (standby to active and active to standby transitions) * Heartbeats to ensure the liveness It should have the same/similar interface as the Linux HA RM to aid pluggability. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2130) Switch default checksum to CRC32C
[ https://issues.apache.org/jira/browse/HDFS-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2130: - Assignee: Todd Lipcon Switch default checksum to CRC32C - Key: HDFS-2130 URL: https://issues.apache.org/jira/browse/HDFS-2130 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs client Reporter: Todd Lipcon Assignee: Todd Lipcon Once the other subtasks/parts of HDFS-2080 are complete, CRC32C will be a much more efficient checksum algorithm than CRC32. Hence we should change the default checksum to CRC32C. However, in order to continue to support append against blocks created with the old checksum, we will need to implement some kind of handshaking in the write pipeline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2487) HA: Administrative CLI to control HA daemons
[ https://issues.apache.org/jira/browse/HDFS-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2487: - Assignee: Todd Lipcon (was: Aaron T. Myers) HA: Administrative CLI to control HA daemons Key: HDFS-2487 URL: https://issues.apache.org/jira/browse/HDFS-2487 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs client Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon We'll need to have some way of controlling the HA nodes while they're live, probably by adding some more commands to dfsadmin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock
[ https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned HDFS-2379: - Assignee: Todd Lipcon 0.20: Allow block reports to proceed without holding FSDataset lock --- Key: HDFS-2379 URL: https://issues.apache.org/jira/browse/HDFS-2379 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.206.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt As disks are getting larger and more plentiful, we're seeing DNs with multiple millions of blocks on a single machine. When page cache space is tight, block reports can take multiple minutes to generate. Currently, during the scanning of the data directories to generate a report, the FSVolumeSet lock is held. This causes writes and reads to block, timeout, etc, causing big problems especially for clients like HBase. This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 for the 0.20.20x series. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira