[jira] [Assigned] (HDFS-3265) PowerPc Build error.

2012-04-12 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3265:
-

Assignee: Kumar Ravi

 PowerPc Build error.
 

 Key: HDFS-3265
 URL: https://issues.apache.org/jira/browse/HDFS-3265
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.2, 1.0.3, 2.0.0
 Environment: Linux RHEL 6.1 PowerPC + IBM JVM 6.0 SR10
Reporter: Kumar Ravi
Assignee: Kumar Ravi
  Labels: patch
 Attachments: HADOOP-8271.patch, HADOOP-8271.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 When attempting to build branch-1, the following error is seen and ant exits.
 [exec] configure: error: Unsupported CPU architecture powerpc64
 The following command was used to build hadoop-common
 ant -Dlibhdfs=true -Dcompile.native=true -Dfusedfs=true -Dcompile.c++=true 
 -Dforrest.home=$FORREST_HOME compile-core-native compile-c++ 
 compile-c++-examples task-controller tar record-parser compile-hdfs-classes 
 package -Djava5.home=/opt/ibm/ibm-java2-ppc64-50/ 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3243) TestParallelRead timing out on jenkins

2012-04-10 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3243:
-

Assignee: Henry Robinson

 TestParallelRead timing out on jenkins
 --

 Key: HDFS-3243
 URL: https://issues.apache.org/jira/browse/HDFS-3243
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, test
Reporter: Todd Lipcon
Assignee: Henry Robinson

 Trunk builds have been failing recently due to a TestParallelRead timeout. It 
 doesn't report in the Jenkins failure list because surefire handles timeouts 
 really poorly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3159) Document NN auto-failover setup and configuration

2012-04-09 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3159:
-

Assignee: Todd Lipcon

 Document NN auto-failover setup and configuration
 -

 Key: HDFS-3159
 URL: https://issues.apache.org/jira/browse/HDFS-3159
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: auto-failover, documentation, ha
Affects Versions: Auto failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 We should document how to configure, set up, and monitor an automatic 
 failover setup. This will require adding the new configs to the *-default.xml 
 and adding prose to the apt docs as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3207) Mechanism for HA failover to ignore fencing errors

2012-04-05 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3207:
-

Assignee: Todd Lipcon

 Mechanism for HA failover to ignore fencing errors
 --

 Key: HDFS-3207
 URL: https://issues.apache.org/jira/browse/HDFS-3207
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Philip Zeyliger
Assignee: Todd Lipcon

 If an administrator wants to ignore fencing (perhaps because he knows that 
 the other namenode has been taken out of comission for sure, but the fencing 
 is ssh-based), there should be a flag to the failover command to allow this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3102) Add CLI tool to initialize the shared-edits dir

2012-04-02 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3102:
-

Assignee: Aaron T. Myers

 Add CLI tool to initialize the shared-edits dir
 ---

 Key: HDFS-3102
 URL: https://issues.apache.org/jira/browse/HDFS-3102
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha, name-node
Affects Versions: 0.24.0, 0.23.3
Reporter: Todd Lipcon
Assignee: Aaron T. Myers

 Currently in order to make a non-HA NN HA, you need to initialize the shared 
 edits dir. This can be done manually by cping directories around. It would be 
 preferable to add a namenode -initializeSharedEdits command to achieve this 
 same effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3072) haadmin should have configurable timeouts for failover commands

2012-03-20 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3072:
-

Assignee: Todd Lipcon

 haadmin should have configurable timeouts for failover commands
 ---

 Key: HDFS-3072
 URL: https://issues.apache.org/jira/browse/HDFS-3072
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha
Affects Versions: 0.24.0
Reporter: Philip Zeyliger
Assignee: Todd Lipcon

 The HAAdmin failover could should time out reasonably aggressively and go 
 onto the fencing strategies if it's dealing with a mostly dead active 
 namenode.  Currently it uses what's probably the default, which is to say no 
 timeout whatsoever.
 {code}
   /**
* Return a proxy to the specified target service.
*/
   protected HAServiceProtocol getProtocol(String serviceId)
   throws IOException {
 String serviceAddr = getServiceAddr(serviceId);
 InetSocketAddress addr = NetUtils.createSocketAddr(serviceAddr);
 return (HAServiceProtocol)RPC.getProxy(
   HAServiceProtocol.class, HAServiceProtocol.versionID,
   addr, getConf());
   }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3084) FenceMethod.tryFence() and ShellCommandFencer should pass namenodeId as well as host:port

2012-03-19 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3084:
-

Assignee: Todd Lipcon

 FenceMethod.tryFence() and ShellCommandFencer should pass namenodeId as well 
 as host:port
 -

 Key: HDFS-3084
 URL: https://issues.apache.org/jira/browse/HDFS-3084
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha
Affects Versions: 0.24.0, 0.23.3
Reporter: Philip Zeyliger
Assignee: Todd Lipcon

 The FenceMethod interface passes along the host:port of the NN that needs to 
 be fenced.  That's great for the common case.  However, it's likely necessary 
 to have extra configuration parameters for fencing, and these are typically 
 keyed off the nameserviceId.namenodeId (if, for nothing else, consistency 
 with all the other parameters that are keyed off of namespaceId.namenodeId).  
 Obviously this can be backed out from the host:port, but it's inconvenient, 
 and requires iterating through all the configs.
 The shell interface exhibits the same issue: host:port is great for most 
 fencers, but if you need extra configs (like the host:port of the power 
 supply unit), those are harder to pipe through without the namenodeId.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3083) HA+security: failed to run a mapred job from yarn after a manual failover

2012-03-16 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3083:
-

Assignee: Aaron T. Myers

 HA+security: failed to run a mapred job from yarn after a manual failover
 -

 Key: HDFS-3083
 URL: https://issues.apache.org/jira/browse/HDFS-3083
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, security
Affects Versions: 0.24.0, 0.23.3
Reporter: Mingjie Lai
Assignee: Aaron T. Myers
Priority: Critical
 Fix For: 0.24.0, 0.23.3


 Steps to reproduce:
 - turned on ha and security
 - run a mapred job, and wait to finish
 - failover to another namenode
 - run the mapred job again, it fails. 
 Checking the job delegation token, it still indicates the original active 
 namenode, and causes nm failed to obtain a dt for the new active nn. (?) 
 {code}
 $ hdfs dfs -cat 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/yarn/.staging/job_1331619043691_0001/appTokens
 HDTS
  ha-hdfs:ns1@(yarn/nn1.hadoop.local@HADOOP.LOCALDOMAINyarn�6
 �L��6.�ЛFs��r�%�B�'��{pR�HDFS_DELEGATION_TOKEN
 ha-hdfs:ns
 {code}
 Exceptions:
 {code}
 12/03/13 06:19:44 INFO mapred.ResourceMgrDelegate: Submitted application 
 application_1331619043691_0002 to ResourceManager at 
 nn1.hadoop.local/10.177.23.38:7090
 12/03/13 06:19:45 INFO mapreduce.Job: The url to track the job: 
 http://nn1.hadoop.local:7050/proxy/application_1331619043691_0002/
 12/03/13 06:19:45 INFO mapreduce.Job: Running job: job_1331619043691_0002
 12/03/13 06:19:47 INFO mapreduce.Job: Job job_1331619043691_0002 running in 
 uber mode : false
 12/03/13 06:19:47 INFO mapreduce.Job:  map 0% reduce 0%
 12/03/13 06:19:47 INFO mapreduce.Job: Job job_1331619043691_0002 failed with 
 state FAILED due to: Application application_1331619043691_0002 failed 1 
 times due to AM Container for appattempt_1331619043691_0002_01 exited 
 with  exitCode: -1000 due to: RemoteTrace: 
 org.apache.hadoop.security.token.SecretManager$InvalidToken: token 
 (HDFS_DELEGATION_TOKEN token 40 for yarn) can't be found in cache
   at org.apache.hadoop.ipc.Client.call(Client.java:1159)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:188)
   at $Proxy28.getFileInfo(Unknown Source)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:622)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
   at $Proxy29.getFileInfo(Unknown Source)
   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1260)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:718)
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:88)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
  at LocalTrace: 
   org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 token (HDFS_DELEGATION_TOKEN token 40 for yarn) can't be found in cache
   at 
 

[jira] [Assigned] (HDFS-3081) SshFenceByTcpPort uses netcat incorrectly

2012-03-12 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3081:
-

Assignee: Todd Lipcon

 SshFenceByTcpPort uses netcat incorrectly
 -

 Key: HDFS-3081
 URL: https://issues.apache.org/jira/browse/HDFS-3081
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 0.24.0
Reporter: Philip Zeyliger
Assignee: Todd Lipcon

 SshFencyByTcpPort currently assumes that the NN is listening on localhost.  
 Typical setups have the namenode listening just on the hostname of the 
 namenode, which would lead nc -z to not catch it.
 Here's an example in which the NN is running, listening on 8020, but doesn't 
 respond to localhost 8020.
 {noformat}
 [root@xxx ~]# lsof -P -p 5286 | grep -i listen
 java5286 root  110u  IPv41772357  TCP xxx:8020 
 (LISTEN)
 java5286 root  121u  IPv41772397  TCP xxx:50070 
 (LISTEN)
 [root@xxx ~]# nc -z localhost 8020
 [root@xxx ~]# nc -z xxx 8020
 Connection to xxx 8020 port [tcp/intu-ec-svcdisc] succeeded!
 {noformat}
 Here's the likely offending code:
 {code}
 LOG.info(
 Indeterminate response from trying to kill service.  +
 Verifying whether it is running using nc...);
 rc = execCommand(session, nc -z localhost 8020);
 {code}
 Naively, we could rely on netcat to the correct hostname (since the NN ought 
 to be listening on the hostname it's configured as), or just to use fuser.  
 Fuser catches ports independently of what IPs they're bound to:
 {noformat}
 [root@xxx ~]# fuser 1234/tcp
 1234/tcp: 6766  6768
 [root@xxx ~]# jobs
 [1]-  Running nc -l localhost 1234 
 [2]+  Running nc -l rhel56-18.ent.cloudera.com 1234 
 [root@xxx ~]# sudo lsof -P | grep -i LISTEN | grep -i 1234
 nc 6766  root3u IPv42563626 
 TCP localhost:1234 (LISTEN)
 nc 6768  root3u IPv42563671 
 TCP xxx:1234 (LISTEN)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3062) Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up by job submission.

2012-03-11 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3062:
-

Assignee: Mingjie Lai  (was: Todd Lipcon)

 Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up 
 by job submission.
 

 Key: HDFS-3062
 URL: https://issues.apache.org/jira/browse/HDFS-3062
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, security
Affects Versions: 0.24.0
Reporter: Mingjie Lai
Assignee: Mingjie Lai
Priority: Critical
 Fix For: 0.24.0

 Attachments: HDFS-3062-trunk.patch


 When testing the combination of NN HA + security + yarn, I found that the 
 mapred job submission cannot pick up the logic URI of a nameservice. 
 I have logic URI configured in core-site.xml
 {code}
 property
  namefs.defaultFS/name
  valuehdfs://ns1/value
 /property
 {code}
 HDFS client can work with the HA deployment/configs:
 {code}
 [root@nn1 hadoop]# hdfs dfs -ls /
 Found 6 items
 drwxr-xr-x   - hbase  hadoop  0 2012-03-07 20:42 /hbase
 drwxrwxrwx   - yarn   hadoop  0 2012-03-07 20:42 /logs
 drwxr-xr-x   - mapred hadoop  0 2012-03-07 20:42 /mapred
 drwxr-xr-x   - mapred hadoop  0 2012-03-07 20:42 /mr-history
 drwxrwxrwt   - hdfs   hadoop  0 2012-03-07 21:57 /tmp
 drwxr-xr-x   - hdfs   hadoop  0 2012-03-07 20:42 /user
 {code}
 but cannot submit a mapred job with security turned on
 {code}
 [root@nn1 hadoop]# /usr/lib/hadoop/bin/yarn --config ./conf jar 
 share/hadoop/mapreduce/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar 
 randomwriter out
 Running 0 maps.
 Job started: Wed Mar 07 23:28:23 UTC 2012
 java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1
   at 
 org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:431)
   at 
 org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:312)
   at 
 org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:217)
   at 
 org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:119)
   at 
 org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
   at 
 org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
   at 
 org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
   at 
 org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:411)
   at 
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:326)
   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1221)
   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218)
 
 {code}0.24

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3071) haadmin failover command does not provide enough detail for when target NN is not ready to be active

2012-03-09 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3071:
-

Assignee: Todd Lipcon

 haadmin failover command does not provide enough detail for when target NN is 
 not ready to be active
 

 Key: HDFS-3071
 URL: https://issues.apache.org/jira/browse/HDFS-3071
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha
Affects Versions: 0.24.0
Reporter: Philip Zeyliger
Assignee: Todd Lipcon

 When running the failover command, you can get an error message like the 
 following:
 {quote}
 $ hdfs --config $(pwd) haadmin -failover namenode2 namenode1
 Failover failed: xxx.yyy/1.2.3.4:8020 is not ready to become active
 {quote}
 Unfortunately, the error message doesn't describe why that node isn't ready 
 to be active.  In my case, the target namenode's logs don't indicate anything 
 either. It turned out that the issue was Safe mode is ON.Resources are low 
 on NN. Safe mode must be turned off manually., but ideally the user would be 
 told that at the time of the failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2185) HA: ZK-based FailoverController

2012-03-02 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2185:
-

Assignee: Todd Lipcon  (was: Bikas Saha)

 HA: ZK-based FailoverController
 ---

 Key: HDFS-2185
 URL: https://issues.apache.org/jira/browse/HDFS-2185
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Todd Lipcon

 This jira is for a ZK-based FailoverController daemon. The FailoverController 
 is a separate daemon from the NN that does the following:
 * Initiates leader election (via ZK) when necessary
 * Performs health monitoring (aka failure detection)
 * Performs fail-over (standby to active and active to standby transitions)
 * Heartbeats to ensure the liveness
 It should have the same/similar interface as the Linux HA RM to aid 
 pluggability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3038) Add FSEditLog.metrics to findbugs exclude list

2012-03-01 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-3038:
-

Assignee: Todd Lipcon

 Add FSEditLog.metrics to findbugs exclude list
 --

 Key: HDFS-3038
 URL: https://issues.apache.org/jira/browse/HDFS-3038
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.24.0, 0.23.3
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Trivial
 Attachments: hdfs-3038.txt


 to fix a findbugs error on trunk -- this field is only re-set by tests, no 
 need to be worried about its synchronization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2731) HA: Autopopulate standby name dirs if they're empty

2012-03-01 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2731:
-

Assignee: Todd Lipcon  (was: Eli Collins)

 HA: Autopopulate standby name dirs if they're empty
 ---

 Key: HDFS-2731
 URL: https://issues.apache.org/jira/browse/HDFS-2731
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Todd Lipcon

 To setup a SBN we currently format the primary then manually copy the name 
 dirs to the SBN. The SBN should do this automatically. Specifically, on NN 
 startup, if HA with a shared edits dir is configured and populated, if the 
 SBN has empty name dirs it should downloads the image and log from the 
 primary (as an optimization it could copy the logs from the shared dir). If 
 the other NN is still in standby then it should fail to start as it does 
 currently.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2948) HA: NN throws NPE during shutdown if it fails to startup

2012-02-14 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2948:
-

Assignee: Todd Lipcon

 HA: NN throws NPE during shutdown if it fails to startup
 

 Key: HDFS-2948
 URL: https://issues.apache.org/jira/browse/HDFS-2948
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor
 Attachments: hdfs-2948.txt


 Last night's nightly build had a bunch of NPEs thrown in NameNode.stop. Not 
 sure which patch introduced the issue, but the problem is that 
 NameNode.stop() is called if an exception is thrown during startup. If the 
 exception is thrown before the namesystem is created, then 
 NameNode.namesystem is null, and {{namesystem.stop}} throws NPE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2920) HA: fix remaining TODO items

2012-02-09 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2920:
-

Assignee: Todd Lipcon

 HA: fix remaining TODO items
 

 Key: HDFS-2920
 URL: https://issues.apache.org/jira/browse/HDFS-2920
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Reporter: Eli Collins
Assignee: Todd Lipcon

 There are a number of TODO(HA) and TODO:HA comments we need to fix or 
 remove.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2878) TestBlockRecovery does not compile

2012-02-09 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2878:
-

Assignee: Todd Lipcon

 TestBlockRecovery does not compile
 --

 Key: HDFS-2878
 URL: https://issues.apache.org/jira/browse/HDFS-2878
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.23.1
Reporter: Eli Collins
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hdfs-2878.txt


 Looks like HDFS-2563 introduced a compilation error in TestBlockRecovery. We 
 didn't catch this because of HDFS-2876.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2935) Shared edits dir property should be suffixed with nameservice and namenodeID

2012-02-09 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2935:
-

Assignee: Todd Lipcon

 Shared edits dir property should be suffixed with nameservice and namenodeID
 

 Key: HDFS-2935
 URL: https://issues.apache.org/jira/browse/HDFS-2935
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Vinithra Varadharajan
Assignee: Todd Lipcon

 Similar to the NameNode's name dirs, we should also be able to specify the 
 shared edits dir as dfs.namenode.shared.edits.dir.nameserviceId.nnId.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2924) Standby checkpointing fails to authenticate in secure cluster

2012-02-08 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2924:
-

Assignee: Todd Lipcon

 Standby checkpointing fails to authenticate in secure cluster
 -

 Key: HDFS-2924
 URL: https://issues.apache.org/jira/browse/HDFS-2924
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node, security
Affects Versions: HA branch (HDFS-1623)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical

 When running HA on a secure cluster, the SBN checkpointing process doesn't 
 seem to pick up the keytab-based credentials for its RPC connection to the 
 active. I think we're just missing a doAs() in the right spot.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2794) HA: Active NN may purge edit log files before standby NN has a chance to read them

2012-02-03 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2794:
-

Assignee: Todd Lipcon  (was: Aaron T. Myers)

 HA: Active NN may purge edit log files before standby NN has a chance to read 
 them
 --

 Key: HDFS-2794
 URL: https://issues.apache.org/jira/browse/HDFS-2794
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Aaron T. Myers
Assignee: Todd Lipcon

 Given that the active NN is solely responsible for purging finalized edit log 
 segments, and given that the active NN has no way of knowing when the standby 
 reads edit logs, it's  possible that the standby NN could fail to read all 
 edits it needs before the active purges the files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2874) HA: edit log should log to shared dirs before local dirs

2012-02-01 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2874:
-

Assignee: Todd Lipcon

 HA: edit log should log to shared dirs before local dirs
 

 Key: HDFS-2874
 URL: https://issues.apache.org/jira/browse/HDFS-2874
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical

 Currently, the NN logs its edits to each of its edits directories in 
 sequence. This can produce the following bad sequence:
 - NN accumulates 100 edits (tx 1-100) in the buffer. Writes and syncs to 
 local drive, then crashes
 - Failover occurs. SBN takes over at txid=1, since txid 1 never got writen.
 - First NN restarts. It reads up to txid 100 from its local directories. It 
 is now ahead of the active NN with inconsistent state.
 The solution is to write to the shared edits dir, and sync that, before 
 writing to any local drives.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2185) HA: ZK-based FailoverController

2012-01-24 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2185:
-

Assignee: Bikas Saha  (was: Todd Lipcon)

 HA: ZK-based FailoverController
 ---

 Key: HDFS-2185
 URL: https://issues.apache.org/jira/browse/HDFS-2185
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Bikas Saha

 This jira is for a ZK-based FailoverController daemon. The FailoverController 
 is a separate daemon from the NN that does the following:
 * Initiates leader election (via ZK) when necessary
 * Performs health monitoring (aka failure detection)
 * Performs fail-over (standby to active and active to standby transitions)
 * Heartbeats to ensure the liveness
 It should have the same/similar interface as the Linux HA RM to aid 
 pluggability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2579) Starting delegation token manager during safemode fails

2012-01-22 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2579:
-

Assignee: Todd Lipcon

 Starting delegation token manager during safemode fails
 ---

 Key: HDFS-2579
 URL: https://issues.apache.org/jira/browse/HDFS-2579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 I noticed this on the HA branch, but it seems to actually affect non-HA 
 branch 0.23 if security is enabled. When the NN starts up, if security is 
 enabled, we start the delegation token secret manager, which then tries to 
 call {{logUpdateMasterKey}}. This fails because the edit logs may not be 
 written while in safe-mode.
 It seems to me that there's not any necessary reason that you have to make a 
 new master key at startup, since you've loaded the old key when you load the 
 FSImage. You'd only be lacking a DT master key on a fresh cluster, in which 
 case we could have it generate one at format time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2810) Leases not properly getting renewed by clients

2012-01-19 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2810:
-

Assignee: Todd Lipcon

 Leases not properly getting renewed by clients
 --

 Key: HDFS-2810
 URL: https://issues.apache.org/jira/browse/HDFS-2810
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical

 We've been testing HBase on clusters running trunk and seen an issue where 
 they seem to lose their HDFS leases after a couple of hours of runtime. We 
 don't quite have enough data to understand what's happening, but the NN is 
 expiring them, claiming the hard lease period has elapsed. The clients report 
 no error until their output stream gets killed underneath them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2803) Adding logging to LeaseRenewer for better lease expiration triage.

2012-01-18 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2803:
-

Assignee: Jimmy Xiang

 Adding logging to LeaseRenewer for better lease expiration triage.
 --

 Key: HDFS-2803
 URL: https://issues.apache.org/jira/browse/HDFS-2803
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
  Labels: newbie
 Attachments: hdfs-2803.txt


 It will be helpful to add some logging to LeaseRenewer when the daemon is 
 terminated (Info level),
 and when the lease is renewed (Debug level).  Since lacking logging, it is 
 hard to know
 if a DFS client doesn't renew the lease because it hangs, or the lease 
 renewer daemon is gone somehow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up

2012-01-16 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2795:
-

Assignee: Todd Lipcon  (was: Aaron T. Myers)

 HA: Standby NN takes a long time to recover from a dead DN starting up
 --

 Key: HDFS-2795
 URL: https://issues.apache.org/jira/browse/HDFS-2795
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node, ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Aaron T. Myers
Assignee: Todd Lipcon
Priority: Critical

 To reproduce:
 # Start an HA cluster with a DN.
 # Write several blocks to the FS with replication 1.
 # Shutdown the DN
 # Wait for the NNs to declare the DN dead. All blocks will be 
 under-replicated.
 # Restart the DN.
 Note that upon restarting the DN, the active NN will immediately get all 
 block locations from the initial BR. The standby NN will not, and instead 
 will slowly add block locations for a subset of the previously-missing blocks 
 on every DN heartbeat.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2767) HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol

2012-01-10 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2767:
-

Assignee: Todd Lipcon

 HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol
 ---

 Key: HDFS-2767
 URL: https://issues.apache.org/jira/browse/HDFS-2767
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, hdfs client
Affects Versions: HA branch (HDFS-1623)
Reporter: Uma Maheswara Rao G
Assignee: Todd Lipcon
Priority: Blocker

 Presentely ConfiguredFailoverProxyProvider supports ClinetProtocol.
 It should support NameNodeProtocol also, because Balancer uses 
 NameNodeProtocol for getting blocks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2724) NN web UI can throw NPE after startup, before standby state is entered

2012-01-08 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2724:
-

Assignee: Todd Lipcon  (was: Eli Collins)

 NN web UI can throw NPE after startup, before standby state is entered
 --

 Key: HDFS-2724
 URL: https://issues.apache.org/jira/browse/HDFS-2724
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Aaron T. Myers
Assignee: Todd Lipcon

 There's a brief period of time (a few seconds) after the NN web server has 
 been initialized, but before the NN's HA state is initialized. If 
 {{dfshealth.jsp}} is hit during this time, a {{NullPointerException}} will be 
 thrown.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2733) Document HA configuration and CLI

2012-01-08 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2733:
-

Assignee: Todd Lipcon

 Document HA configuration and CLI
 -

 Key: HDFS-2733
 URL: https://issues.apache.org/jira/browse/HDFS-2733
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: documentation, ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Todd Lipcon

 We need to document the configuration changes in HDFS-2231 and the new CLI 
 introduced by HADOOP-7774.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2762) TestCheckpoint is timing out

2012-01-06 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2762:
-

Assignee: Uma Maheswara Rao G  (was: Todd Lipcon)

 TestCheckpoint is timing out
 

 Key: HDFS-2762
 URL: https://issues.apache.org/jira/browse/HDFS-2762
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Aaron T. Myers
Assignee: Uma Maheswara Rao G

 TestCheckpoint is timing out on the HA branch, and has been for a few days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2738) FSEditLog.selectinputStreams is reading through in-progress streams even when non-in-progress are requested

2012-01-03 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2738:
-

Assignee: Aaron T. Myers  (was: Todd Lipcon)

 FSEditLog.selectinputStreams is reading through in-progress streams even when 
 non-in-progress are requested
 ---

 Key: HDFS-2738
 URL: https://issues.apache.org/jira/browse/HDFS-2738
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Todd Lipcon
Assignee: Aaron T. Myers
Priority: Critical

 The new code in HDFS-1580 is causing an issue with selectInputStreams in the 
 HA context. When the active is writing to the shared edits, 
 selectInputStreams is called on the standby. This ends up calling 
 {{journalSet.getInputStream}} but doesn't pass the {{inProgressOk=false}} 
 flag. So, {{getInputStream}} ends up reading and validating the in-progress 
 stream unnecessarily. Since the validation results are no longer properly 
 cached, {{findMaxTransaction}} also re-validates the in-progress stream, and 
 then breaks the corruption check in this code. The end result is a lot of 
 errors like:
 2011-12-30 16:45:02,521 ERROR namenode.FileJournalManager 
 (FileJournalManager.java:getNumberOfTransactions(266)) - Gap in transactions, 
 max txnid is 579, 0 txns from 578
 2011-12-30 16:45:02,521 INFO  ha.EditLogTailer (EditLogTailer.java:run(163)) 
 - Got error, will try again.
 java.io.IOException: No non-corrupt logs for txid 578
   at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.getInputStream(JournalSet.java:229)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1081)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:115)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$0(EditLogTailer.java:100)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:154)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2693) Synchronization issues around state transition

2011-12-15 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2693:
-

Assignee: Todd Lipcon

 Synchronization issues around state transition
 --

 Key: HDFS-2693
 URL: https://issues.apache.org/jira/browse/HDFS-2693
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical

 Currently when the NN changes state, it does so without synchronization. In 
 particular, the state transition function does:
 (1) leave old state
 (2) change state variable
 (3) enter new state
 This means that the NN is marked as active before it has actually 
 transitioned to active mode and opened its edit logs. This gives a window 
 where write transactions can come in and the {{checkOperation}} allows them, 
 but then they fail because the edit log is not yet opened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-1314) dfs.block.size accepts only absolute value

2011-12-10 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-1314:
-

Assignee: Sho Shimauchi  (was: Anwar Abdus-Samad)

Reassigning to Sho since there hasn't been any progress on this and he wanted 
to try it.

 dfs.block.size accepts only absolute value
 --

 Key: HDFS-1314
 URL: https://issues.apache.org/jira/browse/HDFS-1314
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Karim Saadah
Assignee: Sho Shimauchi
Priority: Minor
  Labels: newbie

 Using dfs.block.size=8388608 works 
 but dfs.block.size=8mb does not.
 Using dfs.block.size=8mb should throw some WARNING on NumberFormatException.
 (http://pastebin.corp.yahoo.com/56129)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2626) HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized

2011-12-02 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2626:
-

Assignee: Todd Lipcon

 HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized
 -

 Key: HDFS-2626
 URL: https://issues.apache.org/jira/browse/HDFS-2626
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hdfs-2626.txt


 When starting an HA blockpool with both namenodes up, I often see an NPE, 
 referenced by one of the TODOs. The issue is the following interleaving:
 - first BPActor registers, and sets bpNSInfo in BPOfferService. It then 
 proceeds to initFsDataset which takes a little bit of time
 - second BPActor registers, and sees bpNSInfo is non-null, then proceeds to 
 heartbeat loop. Meanwhile BPActor 1 is still initting FSDataset
 - second BPActor gets an NPE on first heartbeat since fsdataset is still null.
 We just need to synchronize that function to fix the NPE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-1972) HA: Datanode fencing mechanism

2011-11-30 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-1972:
-

Assignee: Todd Lipcon  (was: Suresh Srinivas)

Going to assign this to myself since I'm actively working on it. Suresh, if you 
had started with any design or code, feel free to post what you've got and I'll 
take it from there.

 HA: Datanode fencing mechanism
 --

 Key: HDFS-1972
 URL: https://issues.apache.org/jira/browse/HDFS-1972
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node, name-node
Reporter: Suresh Srinivas
Assignee: Todd Lipcon

 In high availability setup, with an active and standby namenode, there is a 
 possibility of two namenodes sending commands to the datanode. The datanode 
 must honor commands from only the active namenode and reject the commands 
 from standby, to prevent corruption. This invariant must be complied with 
 during fail over and other states such as split brain. This jira addresses 
 issues related to this, design of the solution and implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2185) HA: ZK-based FailoverController

2011-10-31 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2185:
-

Assignee: Todd Lipcon  (was: Eli Collins)

 HA: ZK-based FailoverController
 ---

 Key: HDFS-2185
 URL: https://issues.apache.org/jira/browse/HDFS-2185
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Eli Collins
Assignee: Todd Lipcon

 This jira is for a ZK-based FailoverController daemon. The FailoverController 
 is a separate daemon from the NN that does the following:
 * Initiates leader election (via ZK) when necessary
 * Performs health monitoring (aka failure detection)
 * Performs fail-over (standby to active and active to standby transitions)
 * Heartbeats to ensure the liveness
 It should have the same/similar interface as the Linux HA RM to aid 
 pluggability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2130) Switch default checksum to CRC32C

2011-10-31 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2130:
-

Assignee: Todd Lipcon

 Switch default checksum to CRC32C
 -

 Key: HDFS-2130
 URL: https://issues.apache.org/jira/browse/HDFS-2130
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs client
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 Once the other subtasks/parts of HDFS-2080 are complete, CRC32C will be a 
 much more efficient checksum algorithm than CRC32. Hence we should change the 
 default checksum to CRC32C.
 However, in order to continue to support append against blocks created with 
 the old checksum, we will need to implement some kind of handshaking in the 
 write pipeline.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2487) HA: Administrative CLI to control HA daemons

2011-10-26 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2487:
-

Assignee: Todd Lipcon  (was: Aaron T. Myers)

 HA: Administrative CLI to control HA daemons
 

 Key: HDFS-2487
 URL: https://issues.apache.org/jira/browse/HDFS-2487
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs client
Affects Versions: HA branch (HDFS-1623)
Reporter: Aaron T. Myers
Assignee: Todd Lipcon

 We'll need to have some way of controlling the HA nodes while they're live, 
 probably by adding some more commands to dfsadmin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-17 Thread Todd Lipcon (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned HDFS-2379:
-

Assignee: Todd Lipcon

 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.206.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
 hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira