Yes Esteban I have checked the health of the datanodes from the master in the hadoop console. Nothing seems really wrong to cause this, even though one data-node is apparently lost along with the RS in the process of inserting 50 Million updates... the other 11 are there, up and running so it should pick-up next and that is it (as long as it is replicating as it should through the HDFS pipelining process). I thought of HBase writes-key-hotspotting or some problem in the Hadoop namenode, so checking this out now...
I will keep investigating and let you know, in fact my first thought was same as yours too but ./hadoop fsck / is showing all "active" nodes are healthy nodes, and no file-system level inconsistencies are detected (first thing I checked before sending the post). Of course running the HBase hbck consistency check from the command line behaves differently, missing the mentioned RS in place and throws corresponding exception log.... that is a weird one then... I might check the name node before I get back to you on this. I can't think of anything else as of now. Space is not unlimited, yet sufficient in each of the data-nodes (12) but getting close to its limit in the mentioned dead RS so yes writes are yet not very balanced but definitely not the issue as I understand. On 5 April 2014 19:16, Esteban Gutierrez <este...@cloudera.com> wrote: > Álvaro, > > Have you checked for the health of HDFS? Maybe your cluster ran out of > space or you don't have data nodes running. > > Esteban > > > On Apr 5, 2014, at 10:11, haosdent <haosd...@gmail.com> wrote: > > > > From the log informations, it seems you lost blocks. > > 2014-4-6 上午12:38于 "Álvaro Recuero" <algar...@gmail.com>写道: > > > >> has anyone come across this before? there is still space in the RS and > this > >> is not a problem of datanodes availability as I can confirm. cheers > >> > >> 2014-04-05 09:55:19,210 DEBUG > >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using > new > >> createWriter -- HADOOP-6840 > >> 2014-04-05 09:55:19,211 DEBUG > >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: > >> Path=hdfs:// > >> taurus-5.lyon.grid5000.fr: > >> > >> > 9000/hbase/usertable/fc55e2d2d4bcec49d6fedf5a469353b9/recovered.edits/0000000000002550928.temp, > >> syncFs=true, hflush=false, compressi > >> on=false > >> 2014-04-05 09:55:19,211 DEBUG > >> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer > >> path=hdfs://taurus-5.lyon.grid5 > >> > >> > 000.fr:9000/hbase/usertable/fc55e2d2d4bcec49d6fedf5a469353b9/recovered.edits/0000000000002550928.tempregion=fc55e2d2d4bcec49d6fedf5 > >> a469353b9 > >> 2014-04-05 09:55:19,233 DEBUG > >> org.apache.hadoop.hbase.regionserver.SplitLogWorker: tasks arrived or > >> departed > >> 2014-04-05 09:55:19,233 WARN org.apache.hadoop.hdfs.DFSClient: > DataStreamer > >> Exception: org.apache.hadoop.ipc.RemoteException: java.i > >> o.IOException: File > >> > >> > /hbase/usertable/237859a0b1e47c86c25a6123506ccb2a/recovered.edits/0000000000002550921.temp > >> could only be replica > >> ted to 0 nodes, instead of 1 > >> at > >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558) > >> at > >> > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) > >> at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > >> at > >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >> at java.lang.reflect.Method.invoke(Method.java:616) > >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) > >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) > >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) > >> at java.security.AccessController.doPrivileged(Native Method) > >> at javax.security.auth.Subject.doAs(Subject.java:416) > >> at > >> > >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) > >> > >> at org.apache.hadoop.ipc.Client.call(Client.java:1070) > >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) > >> at sun.proxy.$Proxy9.addBlock(Unknown Source) > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> at > >> > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > >> at > >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >> at java.lang.reflect.Method.invoke(Method.java:616) > >> at > >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > >> at > >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > >> at sun.proxy.$Proxy9.addBlock(Unknown Source) > >> at > >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3510) > >> at > >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3373) > >> at > >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2589) > >> at > >> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2829) > >> > >> 2014-04-05 09:55:19,233 WARN org.apache.hadoop.hdfs.DFSClient: Error > >> Recovery for block null bad datanode[0] nodes == null > >> 2014-04-05 09:55:19,233 WARN org.apache.hadoop.hdfs.DFSClient: Could not > >> get block locations. Source file > >> > >> > "/hbase/usertable/237859a0b1e47c86c25a6123506ccb2a/recovered.edits/0000000000002550921.temp" > >> - Aborting... > >> >