Recently we had a namenode that had a failed edits directory, and
there was a failover. Things appeared to be functioning properly at
first, but later we had hdfs issues.

Looking at the namenode logs, we saw

2016-06-01 20:38:18,771 ERROR
org.apache.hadoop.net.ScriptBasedMapping: Script
/etc/hadoop/conf/getRackID.sh returned 0 values when 1 were expected.
2016-06-01 20:38:18,771 WARN org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from
10.51.28.100:42826 Call#484441029 Retry#0
java.lang.NullPointerException
  at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:359)
  at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1774)
  at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527)
  at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85)
  at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356)
  at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

So we could see that our rack awareness script was not returning a
value. Then we made changes to the script to return the callers
arguments for the script. We found a list of IPs, some which run
services like oozie, some IPs our gateway server. However none of
these IPs are the datanodes themselves.

The symptoms of this issue were that the namenode itself couldn't cat
files on the system, or make requests to move files on the history
server, etc.

>From my understanding about rack awareness, we just need to provide a
rack id for hosts that are datanodes. However all are datanodes were
listed, and the requested ips were from non-datanodes.

The solution was to provide a default ip for missing IPs in the rack
awareness script. This is not well understood from the rack awareness
docs, and caused a DOS on our hadoop services.

But I want to know why  the rack awareness script is getting called
with IPs of non datanodes from our hadoop namenode. Is this a design
feature of the yarn libraries? Why do non data node IPs need a rack
id?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Reply via email to