Did you set replication to 1? The following error message indicates that the default replication is set to 1:
could only be replicated to 0 nodes, instead of 1 In that case, losing a datanode would mean blocks will be lost. Enis On Fri, Apr 25, 2014 at 1:32 AM, Álvaro Recuero <algar...@gmail.com> wrote: > Data nodes are fine. Actually the Region server on that serverxxxxx is the > solely one dead afterwards. Datanode is up, and HDFS reporting healthy > status. Interesting that is possible. > > I have steadily come across the problem again testing a new HBase cluster, > so yes, I would bet the problem is in HDFS somehow. Probably something is > missing yes. > > 2014-04-24 17:59:30,003 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block null bad datanode[0] nodes == null > 2014-04-24 17:59:30,003 WARN org.apache.hadoop.hdfs.DFSClient: Could not > get block locations. Source file > > "/hbase/.logs/serverxxxxx,1398350408274/serverxxxxx%2C60020%2C1398350408274.1398350409004" > - Aborting... > 2014-04-24 17:59:30,003 ERROR > org.apache.hadoop.hbase.regionserver.wal.HLog: syncer encountered error, > will retry. txid=1 > org.apache.hadoop.ipc.RemoteException: java.io.IOException: File > > /hbase/.logs/serverxxxxx,60020,1398350408274/serverxxxxx%2C60020%2C1398350408274.1398350409004 > could only be replicated to 0 nodes, instead of 1 > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) > at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) > at java.security.AccessController.doPrivileged(Native Method) > at javax.securitWrite failed: Broken pipect.java:416) > > > On 5 April 2014 21:58, Álvaro Recuero <algar...@gmail.com> wrote: > > > Yes Esteban I have checked the health of the datanodes from the master > > in the hadoop console. Nothing seems really wrong to cause this, even > > though one data-node is apparently lost along with the RS in the process > of > > inserting 50 Million updates... the other 11 are there, up and running so > > it should pick-up next and that is it (as long as it is replicating as it > > should through the HDFS pipelining process). I thought of HBase > > writes-key-hotspotting or some problem in the Hadoop namenode, so > checking > > this out now... > > > > I will keep investigating and let you know, in fact my first thought was > > same as yours too but ./hadoop fsck / is showing all "active" nodes are > > healthy nodes, and no file-system level inconsistencies are detected > (first > > thing I checked before sending the post). Of course running the HBase > hbck > > consistency check from the command line behaves differently, missing the > > mentioned RS in place and throws corresponding exception log.... that is > a > > weird one then... I might check the name node before I get back to you on > > this. I can't think of anything else as of now. Space is not unlimited, > yet > > sufficient in each of the data-nodes (12) but getting close to its limit > in > > the mentioned dead RS so yes writes are yet not very balanced but > > definitely not the issue as I understand. > > > > > > On 5 April 2014 19:16, Esteban Gutierrez <este...@cloudera.com> wrote: > > > >> Álvaro, > >> > >> Have you checked for the health of HDFS? Maybe your cluster ran out of > >> space or you don't have data nodes running. > >> > >> Esteban > >> > >> > On Apr 5, 2014, at 10:11, haosdent <haosd...@gmail.com> wrote: > >> > > >> > From the log informations, it seems you lost blocks. > >> > 2014-4-6 上午12:38于 "Álvaro Recuero" <algar...@gmail.com>写道: > >> > > >> >> has anyone come across this before? there is still space in the RS > and > >> this > >> >> is not a problem of datanodes availability as I can confirm. cheers > >> >> > >> >> 2014-04-05 09:55:19,210 DEBUG > >> >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using > >> new > >> >> createWriter -- HADOOP-6840 > >> >> 2014-04-05 09:55:19,211 DEBUG > >> >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: > >> >> Path=hdfs:// > >> >> taurus-5.lyon.grid5000.fr: > >> >> > >> >> > >> > 9000/hbase/usertable/fc55e2d2d4bcec49d6fedf5a469353b9/recovered.edits/0000000000002550928.temp, > >> >> syncFs=true, hflush=false, compressi > >> >> on=false > >> >> 2014-04-05 09:55:19,211 DEBUG > >> >> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating > writer > >> >> path=hdfs://taurus-5.lyon.grid5 > >> >> > >> >> > >> > 000.fr:9000/hbase/usertable/fc55e2d2d4bcec49d6fedf5a469353b9/recovered.edits/0000000000002550928.tempregion=fc55e2d2d4bcec49d6fedf5 > >> >> a469353b9 > >> >> 2014-04-05 09:55:19,233 DEBUG > >> >> org.apache.hadoop.hbase.regionserver.SplitLogWorker: tasks arrived or > >> >> departed > >> >> 2014-04-05 09:55:19,233 WARN org.apache.hadoop.hdfs.DFSClient: > >> DataStreamer > >> >> Exception: org.apache.hadoop.ipc.RemoteException: java.i > >> >> o.IOException: File > >> >> > >> >> > >> > /hbase/usertable/237859a0b1e47c86c25a6123506ccb2a/recovered.edits/0000000000002550921.temp > >> >> could only be replica > >> >> ted to 0 nodes, instead of 1 > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) > >> >> at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown > Source) > >> >> at > >> >> > >> >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >> >> at java.lang.reflect.Method.invoke(Method.java:616) > >> >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) > >> >> at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) > >> >> at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) > >> >> at java.security.AccessController.doPrivileged(Native Method) > >> >> at javax.security.auth.Subject.doAs(Subject.java:416) > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > >> >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) > >> >> > >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1070) > >> >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) > >> >> at sun.proxy.$Proxy9.addBlock(Unknown Source) > >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> >> at > >> >> > >> >> > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > >> >> at > >> >> > >> >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >> >> at java.lang.reflect.Method.invoke(Method.java:616) > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > >> >> at sun.proxy.$Proxy9.addBlock(Unknown Source) > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3510) > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3373) > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2589) > >> >> at > >> >> > >> >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2829) > >> >> > >> >> 2014-04-05 09:55:19,233 WARN org.apache.hadoop.hdfs.DFSClient: Error > >> >> Recovery for block null bad datanode[0] nodes == null > >> >> 2014-04-05 09:55:19,233 WARN org.apache.hadoop.hdfs.DFSClient: Could > >> not > >> >> get block locations. Source file > >> >> > >> >> > >> > "/hbase/usertable/237859a0b1e47c86c25a6123506ccb2a/recovered.edits/0000000000002550921.temp" > >> >> - Aborting... > >> >> > >> > > > > >