RE: Permission Denied
It looks like / is owned by hadoop.supergroup and the perms are 755. You could precreate /accumulo and chown it appropriately, or set the perms for / to 775. Init is trying to create /accumulo in hdfs as the accumulo user and your perms dont allow it. Do you have instance.volumes set in accumulo-site.xml? div Original message /divdivFrom: David Patterson patt...@gmail.com /divdivDate:03/01/2015 3:36 PM (GMT-05:00) /divdivTo: user@hadoop.apache.org /divdivCc: /divdivSubject: Permission Denied /divdiv /div I'm trying to create an Accumulo/Hadoop/Zookeeper configuration on a single (Ubuntu) machine, with Hadoop 2.6.0, Zookeeper 3.4.6 and Accumulo 1.6.1. I've got 3 userids for these components that are in the same group and no other users are in that group. I have zookeeper running, and hadoop as well. Hadoop's core-site.xml file has the hadoop.tmp.dir set to /app/hadoop/tmp.The /app/hadoop/tmp directory is owned by the hadoop user and has permissions that allow other members of the group to write (drwxrwxr-x). When I try to initialize Accumulo, with bin/accumulo init, I get FATAL: Failed to initialize filesystem. org.apache.hadoop.security.AccessControlException: Permission denied: user=accumulo, access=WRITE, inode=/:hadoop:supergroup:drwxr-xr-x So, my main question is which directory do I need to give group-write permission so the accumulo user can write as needed so it can initialize? The second problem is that the Accumulo init reports [Configuration.deprecation] INFO : fs.default.name is deprecated. Instead use fs.defaultFS. However, the hadoop core-site.xml file contains: namefs.defaultFS/name valuehdfs://localhost:9000/value Is there somewhere else that this value (fs.default.name) is specified? Could it be due to Accumulo having a default value and not getting the override from hadoop because of the problem listed above? Thanks Dave Patterson patt...@gmail.com
RE: missing data blocks after active name node crashes
I believe therr was an issue fixed in 2.5 or 2.6 where the standby NN would not process block reports from the DNs when it was dealing with the checkpoint process. The missing blocks will get reported eventually. div Original message /divdivFrom: Chen Song chen.song...@gmail.com /divdivDate:02/10/2015 2:44 PM (GMT-05:00) /divdivTo: user@hadoop.apache.org, Ravi Prakash ravi...@ymail.com /divdivCc: /divdivSubject: Re: missing data blocks after active name node crashes /divdiv /div Thanks for the reply, Ravi. In my case, what I see constantly is there are always missing blocks every time active name node crashes. The active name node crashes because of timeout on journal nodes. Could this be a specific case which could lead to missing blocks? Chen On Tue, Feb 10, 2015 at 2:20 PM, Ravi Prakash ravi...@ymail.com wrote: Hi Chen! From my understanding, every operation on the Namenode is logged (and flushed) to disk / QJM / shared storage. This includes the addBlock operation. So when a client requests to write a new block, the metadata is logged by the active NN, so even if it crashes later on, the new active NN would still see the creation of the block. HTH Ravi On Tuesday, February 10, 2015 9:38 AM, Chen Song chen.song...@gmail.com wrote: When the active name node crashes, it seems there is always a chance that the data blocks in flight will be missing. My understanding is that when the active name node crashes, the metadata of data blocks in transition which exist in active name node memory is not successfully captured by journal nodes and thus not available on standby name node when it is promoted to active by zkfc. Is my understanding correct? Any way to mitigate this problem or race condition? -- Chen Song -- Chen Song
Client usage with multiple clusters
I'm having an issue in client code where there are multiple clusters with HA namenodes involved. Example setup using Hadoop 2.3.0: Cluster A with the following properties defined in core, hdfs, etc: dfs.nameservices=clusterA dfs.ha.namenodes.clusterA=nn1,nn2 dfs.namenode.rpc-address.clusterA.nn1= dfs.namenode.http-address.clusterA.nn1= dfs.namenode.rpc-address.clusterA.nn2= dfs.namenode.http-address.clusterA.nn2= dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider Cluster B has similar properties defined in its core-site.xml, hdfs-site.xml, etc. Now, I want to be able to distcp from clusterA to clusterB. Regardless of which cluster I am executing this from, neither has all of the information. Looking at DFSClient and DataNode: - if I put both clusterA and clusterB into dfs.nameservices, then the datanodes will try to federate the blocks from both nameservices. - if I don't put both clusterA and clusterB into dfs.nameservices, then the client won't know how to resolve both namenodes for the nameservices in the distcp command. I'm wondering if I am missing a property or something that will allow me to define both nameservice on both clusters and have the datanodes for the cluster *not* try and federate. Looking at DataNode, it appears that it tries to connect to all namenodes defined and the first one that sets the clusterid wins. It seems that there should be a dfs.datanode.clusterid property that the datanode uses. This seems to line up with 'namenode -format -clusterid cluster' command when you have multiple nameservices. Am I missing something in the configuration that will allow me to do what I want? To get distcp to work I had to create a 3 set of configuration files just for the client to use.
RE: Which Hadoop 2.x .jars are necessary for Apache Commons VFS HDFS access?
Hi Roger, I wrote the HDFS provider for Commons VFS. I went back and looked at the source and tests, and I don't see anything wrong with what you are doing. I did develop it against Hadoop 1.1.2 at the time, so there might be an issue that is not accounted for with Hadoop 2. It was also not tested with security turned on. Are you using security? Dave From: roger.whitc...@actian.com To: user@hadoop.apache.org Subject: Which Hadoop 2.x .jars are necessary for Apache Commons VFS HDFS access? Date: Fri, 11 Apr 2014 20:20:06 + Hi, I'm fairly new to Hadoop, but not to Apache, and I'm having a newbie kind of issue browsing HDFS files. I have written an Apache Commons VFS (Virtual File System) browser for the Apache Pivot GUI framework (I'm the PMC Chair for Pivot: full disclosure). And now I'm trying to get this browser to work with HDFS to do HDFS browsing from our application. I'm running into a problem, which seems sort of basic, so I thought I'd ask here... So, I downloaded Hadoop 2.3.0 from one of the mirrors, and was able to track down sort of the minimum set of .jars necessary to at least (try to) connect using Commons VFS 2.1: commons-collections-3.2.1.jar commons-configuration-1.6.jar commons-lang-2.6.jar commons-vfs2-2.1-SNAPSHOT.jar guava-11.0.2.jar hadoop-auth-2.3.0.jar hadoop-common-2.3.0.jar log4j-1.2.17.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar What's happening now is that I instantiated the HdfsProvider this way: private static DefaultFileSystemManager manager = null; static { manager = new DefaultFileSystemManager(); try { manager.setFilesCache(new DefaultFilesCache()); manager.addProvider(hdfs, new HdfsFileProvider()); manager.setFileContentInfoFactory(new FileContentInfoFilenameFactory()); manager.setFilesCache(new SoftRefFilesCache()); manager.setReplicator(new DefaultFileReplicator()); manager.setCacheStrategy(CacheStrategy.ON_RESOLVE); manager.init(); } catch (final FileSystemException e) { throw new RuntimeException(Intl.getString(object#manager.setupError), e); } } Then, I try to browse into an HDFS system this way: String url = String.format(hdfs://%1$s:%2$d/%3$s, hadoop-master , 50070, hdfsPath); return manager.resolveFile(url); Note: the client is running on Windows 7 (but could be any system that runs Java), and the target has been one of several Hadoop clusters on Ubuntu VMs (basically the same thing happens no matter which Hadoop installation I try to hit). So I'm guessing the problem is in my client configuration. This attempt to basically just connect to HDFS results in a bunch of error messages in the log file, which looks like it is trying to do user validation on the local machine instead of against the Hadoop (remote) cluster. Apr 11,2014 18:27:38.640 GMT T[AWT-EventQueue-0](26) DEBUG FileObjectManager: Trying to resolve file reference 'hdfs://hadoop-master:50070/' Apr 11,2014 18:27:38.953 GMT T[AWT-EventQueue-0](26) INFO org.apache.hadoop.conf.Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS Apr 11,2014 18:27:39.078 GMT T[AWT-EventQueue-0](26) DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate of successful kerberos logins and latency (milliseconds)], about=, type=DEFAULT, always=false, sampleName=Ops) Apr 11,2014 18:27:39.094 GMT T[AWT-EventQueue-0](26) DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate of failed kerberos logins and latency (milliseconds)], about=, type=DEFAULT, always=false, sampleName=Ops) Apr 11,2014 18:27:39.094 GMT T[AWT-EventQueue-0](26) DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[GetGroups], about=, type=DEFAULT, always=false, sampleName=Ops) Apr 11,2014 18:27:39.094 GMT T[AWT-EventQueue-0](26) DEBUG MetricsSystemImpl: UgiMetrics, User and group related metrics Apr 11,2014 18:27:39.344 GMT T[AWT-EventQueue-0](26) DEBUG Groups: Creating new Groups object Apr 11,2014 18:27:39.344 GMT T[AWT-EventQueue-0](26) DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library... Apr 11,2014 18:27:39.360 GMT T[AWT-EventQueue-0](26) DEBUG NativeCodeLoader: Failed to load native-hadoop with error:
RE: Which Hadoop 2.x .jars are necessary for Apache Commons VFS HDFS access?
Also, make sure that the jars on the classpath actually contain the HDFS file system. I'm looking at: No FileSystem for scheme: hdfs which is an indicator for this condition. Dave From: dlmar...@hotmail.com To: user@hadoop.apache.org Subject: RE: Which Hadoop 2.x .jars are necessary for Apache Commons VFS HDFS access? Date: Fri, 11 Apr 2014 23:48:48 + Hi Roger, I wrote the HDFS provider for Commons VFS. I went back and looked at the source and tests, and I don't see anything wrong with what you are doing. I did develop it against Hadoop 1.1.2 at the time, so there might be an issue that is not accounted for with Hadoop 2. It was also not tested with security turned on. Are you using security? Dave From: roger.whitc...@actian.com To: user@hadoop.apache.org Subject: Which Hadoop 2.x .jars are necessary for Apache Commons VFS HDFS access? Date: Fri, 11 Apr 2014 20:20:06 + Hi, I'm fairly new to Hadoop, but not to Apache, and I'm having a newbie kind of issue browsing HDFS files. I have written an Apache Commons VFS (Virtual File System) browser for the Apache Pivot GUI framework (I'm the PMC Chair for Pivot: full disclosure). And now I'm trying to get this browser to work with HDFS to do HDFS browsing from our application. I'm running into a problem, which seems sort of basic, so I thought I'd ask here... So, I downloaded Hadoop 2.3.0 from one of the mirrors, and was able to track down sort of the minimum set of .jars necessary to at least (try to) connect using Commons VFS 2.1: commons-collections-3.2.1.jar commons-configuration-1.6.jar commons-lang-2.6.jar commons-vfs2-2.1-SNAPSHOT.jar guava-11.0.2.jar hadoop-auth-2.3.0.jar hadoop-common-2.3.0.jar log4j-1.2.17.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar What's happening now is that I instantiated the HdfsProvider this way: private static DefaultFileSystemManager manager = null; static { manager = new DefaultFileSystemManager(); try { manager.setFilesCache(new DefaultFilesCache()); manager.addProvider(hdfs, new HdfsFileProvider()); manager.setFileContentInfoFactory(new FileContentInfoFilenameFactory()); manager.setFilesCache(new SoftRefFilesCache()); manager.setReplicator(new DefaultFileReplicator()); manager.setCacheStrategy(CacheStrategy.ON_RESOLVE); manager.init(); } catch (final FileSystemException e) { throw new RuntimeException(Intl.getString(object#manager.setupError), e); } } Then, I try to browse into an HDFS system this way: String url = String.format(hdfs://%1$s:%2$d/%3$s, hadoop-master , 50070, hdfsPath); return manager.resolveFile(url); Note: the client is running on Windows 7 (but could be any system that runs Java), and the target has been one of several Hadoop clusters on Ubuntu VMs (basically the same thing happens no matter which Hadoop installation I try to hit). So I'm guessing the problem is in my client configuration. This attempt to basically just connect to HDFS results in a bunch of error messages in the log file, which looks like it is trying to do user validation on the local machine instead of against the Hadoop (remote) cluster. Apr 11,2014 18:27:38.640 GMT T[AWT-EventQueue-0](26) DEBUG FileObjectManager: Trying to resolve file reference 'hdfs://hadoop-master:50070/' Apr 11,2014 18:27:38.953 GMT T[AWT-EventQueue-0](26) INFO org.apache.hadoop.conf.Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS Apr 11,2014 18:27:39.078 GMT T[AWT-EventQueue-0](26) DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate of successful kerberos logins and latency (milliseconds)], about=, type=DEFAULT, always=false, sampleName=Ops) Apr 11,2014 18:27:39.094 GMT T[AWT-EventQueue-0](26) DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate of failed kerberos logins and latency (milliseconds)], about=, type=DEFAULT, always=false, sampleName=Ops) Apr 11,2014 18:27:39.094 GMT T[AWT-EventQueue-0](26) DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[GetGroups], about=, type=DEFAULT, always=false, sampleName=Ops) Apr 11,2014 18:27:39.094 GMT T[AWT-EventQueue-0](26) DEBUG MetricsSystemImpl: UgiMetrics, User and group
RE: HA NN Failover question
I think I found the issue. The ZKFC on the standby NN server tried, and failed, to connect to the standby NN when I shutdown the network on the Active NN server. I'm getting an exception from the HealthMonitor in the ZKFC log: WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception try to monitor health of NameNode at host/ip:port. INFO org.apache.hadoop.ipc.CLient: Retrying connect to server host/ip:port. Already tried 0 time(s); retry policy is (the default) Is it significant that it thinks the address is host/ip, instead of just the host or the ip? From: azury...@gmail.com Subject: Re: HA NN Failover question Date: Sat, 15 Mar 2014 11:35:20 +0800 To: user@hadoop.apache.org I suppose NN2 is standby, please check ZKFC2 is alive before stop network on nn1 Sent from my iPhone5s On 2014年3月15日, at 10:53, dlmarion dlmar...@hotmail.com wrote: Apache Hadoop 2.3.0 Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone Original message From: Azuryy Date:03/14/2014 10:45 PM (GMT-05:00) To: user@hadoop.apache.org Subject: Re: HA NN Failover question Which Hadoop version you used? Sent from my iPhone5s On 2014年3月15日, at 9:29, dlmarion dlmar...@hotmail.com wrote: Server 1: NN1 and ZKFC1 Server 2: NN2 and ZKFC2 Server 3: Journal1 and ZK1 Server 4: Journal2 and ZK2 Server 5: Journal3 and ZK3 Server 6+: Datanode All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works). - Dave From: Juan Carlos [mailto:juc...@gmail.com] Sent: Friday, March 14, 2014 9:12 PM To: user@hadoop.apache.org Subject: Re: HA NN Failover question Hi Dave, How many zookeeper servers do you have and where are them? Juan Carlos Fernández Rodríguez El 15/03/2014, a las 01:21, dlmarion dlmar...@hotmail.com escribió: I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 pid’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem. FWIW, when the network was restarted on the active NN, it failed over almost immediately. Thanks, Dave
RE: HA NN Failover question
Found this: http://grokbase.com/t/cloudera/cdh-user/12anhyr8ht/cdh4-failover-controllers Then configured dfs.ha.fencing.methods to contain both sshfence and shell(/bin/true). Note that the docs for core-default.xml say that the value is a list. I tried a comma with no luck. Had to look in the src to find it's separated by a newline. Adding shell(/bin/true) allowed it to work successfully. From: dlmar...@hotmail.com To: user@hadoop.apache.org Subject: RE: HA NN Failover question Date: Tue, 18 Mar 2014 14:51:25 + I think I found the issue. The ZKFC on the standby NN server tried, and failed, to connect to the standby NN when I shutdown the network on the Active NN server. I'm getting an exception from the HealthMonitor in the ZKFC log: WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception try to monitor health of NameNode at host/ip:port. INFO org.apache.hadoop.ipc.CLient: Retrying connect to server host/ip:port. Already tried 0 time(s); retry policy is (the default) Is it significant that it thinks the address is host/ip, instead of just the host or the ip? From: azury...@gmail.com Subject: Re: HA NN Failover question Date: Sat, 15 Mar 2014 11:35:20 +0800 To: user@hadoop.apache.org I suppose NN2 is standby, please check ZKFC2 is alive before stop network on nn1 Sent from my iPhone5s On 2014年3月15日, at 10:53, dlmarion dlmar...@hotmail.com wrote: Apache Hadoop 2.3.0 Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone Original message From: Azuryy Date:03/14/2014 10:45 PM (GMT-05:00) To: user@hadoop.apache.org Subject: Re: HA NN Failover question Which Hadoop version you used? Sent from my iPhone5s On 2014年3月15日, at 9:29, dlmarion dlmar...@hotmail.com wrote: Server 1: NN1 and ZKFC1 Server 2: NN2 and ZKFC2 Server 3: Journal1 and ZK1 Server 4: Journal2 and ZK2 Server 5: Journal3 and ZK3 Server 6+: Datanode All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works). - Dave From: Juan Carlos [mailto:juc...@gmail.com] Sent: Friday, March 14, 2014 9:12 PM To: user@hadoop.apache.org Subject: Re: HA NN Failover question Hi Dave, How many zookeeper servers do you have and where are them? Juan Carlos Fernández Rodríguez El 15/03/2014, a las 01:21, dlmarion dlmar...@hotmail.com escribió: I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 pid’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem. FWIW, when the network was restarted on the active NN, it failed over almost immediately. Thanks, Dave