LINTE created HDFS-8897: --------------------------- Summary: Loadbalancer Key: HDFS-8897 URL: https://issues.apache.org/jira/browse/HDFS-8897 Project: Hadoop HDFS Issue Type: Bug Components: balancer & mover Affects Versions: 2.7.1 Environment: Centos 6.6 Reporter: LINTE
When balancer is launched, it should test if there is already a /system/balancer.id file in HDFS. When the file doesn't exist, the balancer don't want to run : 15/08/14 16:35:12 INFO balancer.Balancer: namenodes = [hdfs://sandbox/, hdfs://sandbox] 15/08/14 16:35:12 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec java.io.IOException: Another Balancer is running.. Exiting ... Aug 14, 2015 4:35:14 PM Balancing took 2.408 seconds Looking at the audit log file when trying to run the balancer, the balancer create the /system/balancer.id and then delete it on exiting ... 2015-08-14 16:37:45,844 INFO FSNamesystem.audit: allowed=true ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x cmd=getfileinfo src=/system/balancer.id dst=null perm=null proto=rpc 2015-08-14 16:37:45,900 INFO FSNamesystem.audit: allowed=true ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x cmd=create src=/system/balancer.id dst=null perm=hdfs:hadoop:rw-r----- proto=rpc 2015-08-14 16:37:45,919 INFO FSNamesystem.audit: allowed=true ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x cmd=getfileinfo src=/system/balancer.id dst=null perm=null proto=rpc 2015-08-14 16:37:46,090 INFO FSNamesystem.audit: allowed=true ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x cmd=getfileinfo src=/system/balancer.id dst=null perm=null proto=rpc 2015-08-14 16:37:46,112 INFO FSNamesystem.audit: allowed=true ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x cmd=getfileinfo src=/system/balancer.id dst=null perm=null proto=rpc 2015-08-14 16:37:46,117 INFO FSNamesystem.audit: allowed=true ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x cmd=delete src=/system/balancer.id dst=null perm=null proto=rpc The error seems to be located in org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java The function checkAndMarkRunning return null even if the /system/balancer.id doesn't exist before entering this function; if it exists, then it is deleted and the balancer exit with the same error. ---- private OutputStream checkAndMarkRunning() throws IOException { try { if (fs.exists(idPath)) { // try appending to it so that it will fail fast if another balancer is // running. IOUtils.closeStream(fs.append(idPath)); fs.delete(idPath, true); } final FSDataOutputStream fsout = fs.create(idPath, false); // mark balancer idPath to be deleted during filesystem closure fs.deleteOnExit(idPath); if (write2IdFile) { fsout.writeBytes(InetAddress.getLocalHost().getHostName()); fsout.hflush(); } return fsout; } catch(RemoteException e) { if(AlreadyBeingCreatedException.class.getName().equals(e.getClassName())){ return null; } else { throw e; } } } ---- Regards -- This message was sent by Atlassian JIRA (v6.3.4#6332)