Hey Dan: On Tue, Jun 1, 2010 at 2:57 AM, Dan Harvey <[email protected]> wrote: > In what cases would a datanode failure (for example running out of > memory in ourcase) cause HBase data loss?
We should just move past the damaged DN on to the other replicas but there are probably places where we can get hungup. Out of interest are you running with hdfs-630 inplace? > Would it mostly only causes dataloss to the meta regions or does it > also cause problems with the actual region files? > HDFS files that had their blocks located on the damaged DN would be susceptible (meta files are just like any other). St.Ack >> On Mon, May 24, 2010 at 2:39 PM, Dan Harvey <[email protected]> wrote: >>> Hi, >>> >>> Sorry for the multiple e-mails, it seems gmail didn't send my whole >>> message last time! Anyway here it goes again... >>> >>> Whilst loading data via a mapreduce job into HBase I have started getting >>> this error :- >>> >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>> contact region server Some server, retryOnlyOne=true, index=0, >>> islastrow=false, tries=9, numtries=10, i=0, listsize=19, >>> region=source_documents,ipubmed\x219915054,1274525958679 for region >>> source_documents,ipubmed\x219915054,1274525958679, row 'u1012913162', >>> but failed after 10 attempts. >>> Exceptions: >>> at >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166) >>> at >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247) >>> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609) >>> >>> In the master there are the following three regions :- >>> >>> source_documents,ipubmed\x219859228,1274701893687 hadoop1 >>> 1825870642 ipubmed\x219859228 ipubmed\x219915054 >>> source_documents,ipubmed\x219915054,1274525958679 hadoop4 >>> 193393334 ipubmed\x219915054 u102193588 >>> source_documents,u102193588,1274486550122 hadoop4 >>> 2141795358 u102193588 u105043522 >>> >>> and on one of our 5 nodes I found a region which start with >>> >>> ipubmed\x219915054 and ends with u102002564 >>> >>> and on another I found the other half of the split which starts with >>> >>> u102002564 and ends with u102193588 >>> >>> So it seems that the middle region on the master was split apart but >>> that failed to reach the master. >>> >>> We've had a few problems over the last few days with hdfs nodes >>> failing due to lack of memory which has now been fixed but could have >>> been a cause of this problem. >>> >>> What ways can a split fail to be received by the master and how long >>> would it take for hbase to fix this? I've read it periodically will >>> scan the META table to find problems like this but didn't say how >>> often? It has been about 12h here and our cluster didn't appear to >>> have fixed this missing split, is there a way to force the master to >>> rescan the META table? Will it fix problems like this given time? >>> >>> Thanks, >>> >>> -- >>> Dan Harvey | Datamining Engineer >>> www.mendeley.com/profiles/dan-harvey >>> >>> Mendeley Limited | London, UK | www.mendeley.com >>> Registered in England and Wales | Company Number 6419015 >>> >> > > -- > Dan Harvey | Datamining Engineer > www.mendeley.com/profiles/dan-harvey > > Mendeley Limited | London, UK | www.mendeley.com > Registered in England and Wales | Company Number 6419015 >
