Recently we had a namenode that had a failed edits directory, and there was a failover. Things appeared to be functioning properly at first, but later we had hdfs issues.
Looking at the namenode logs, we saw 2016-06-01 20:38:18,771 ERROR org.apache.hadoop.net.ScriptBasedMapping: Script /etc/hadoop/conf/getRackID.sh returned 0 values when 1 were expected. 2016-06-01 20:38:18,771 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 10.51.28.100:42826 Call#484441029 Retry#0 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:359) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1774) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) So we could see that our rack awareness script was not returning a value. Then we made changes to the script to return the callers arguments for the script. We found a list of IPs, some which run services like oozie, some IPs our gateway server. However none of these IPs are the datanodes themselves. The symptoms of this issue were that the namenode itself couldn't cat files on the system, or make requests to move files on the history server, etc. >From my understanding about rack awareness, we just need to provide a rack id for hosts that are datanodes. However all are datanodes were listed, and the requested ips were from non-datanodes. The solution was to provide a default ip for missing IPs in the rack awareness script. This is not well understood from the rack awareness docs, and caused a DOS on our hadoop services. But I want to know why the rack awareness script is getting called with IPs of non datanodes from our hadoop namenode. Is this a design feature of the yarn libraries? Why do non data node IPs need a rack id? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org