On Wed, Jun 2, 2010 at 2:13 AM, Dan Harvey <[email protected]> wrote: > Yes we're running Cloudera CDH2 which I've just checked includes a > back ported hdfs-630 patch. > > I guess a lot of these issues will be gone once hadoop 0.21 is out and > hbase can take advantage of the new features. >
Thats the hope. A bunch of fixes have gone in for hbase provoked hdfs issues in 0.20. Look out for the append branch in hdfs 0.20 coming soon (It'll be here: http://svn.apache.org/viewvc/hadoop/common/branches/). It'll be a 0.20 branch with support for append (hdfs-200, hdfs-142, etc.) and other fixes needed by hbase. Thats what the next major hbase will ship against (CDH3 will include this stuff and then some, if I understand Todd+crew's plans correctly). Good on you Dan, St.Ack > Thanks, > > On 2 June 2010 01:10, Stack <[email protected]> wrote: >> Hey Dan: >> >> On Tue, Jun 1, 2010 at 2:57 AM, Dan Harvey <[email protected]> wrote: >>> In what cases would a datanode failure (for example running out of >>> memory in ourcase) cause HBase data loss? >> >> We should just move past the damaged DN on to the other replicas but >> there are probably places where we can get hungup. Out of interest >> are you running with hdfs-630 inplace? >> >>> Would it mostly only causes dataloss to the meta regions or does it >>> also cause problems with the actual region files? >>> >> >> HDFS files that had their blocks located on the damaged DN would be >> susceptible (meta files are just like any other). >> >> St.Ack >> >>>> On Mon, May 24, 2010 at 2:39 PM, Dan Harvey <[email protected]> >>>> wrote: >>>>> Hi, >>>>> >>>>> Sorry for the multiple e-mails, it seems gmail didn't send my whole >>>>> message last time! Anyway here it goes again... >>>>> >>>>> Whilst loading data via a mapreduce job into HBase I have started getting >>>>> this error :- >>>>> >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>>>> contact region server Some server, retryOnlyOne=true, index=0, >>>>> islastrow=false, tries=9, numtries=10, i=0, listsize=19, >>>>> region=source_documents,ipubmed\x219915054,1274525958679 for region >>>>> source_documents,ipubmed\x219915054,1274525958679, row 'u1012913162', >>>>> but failed after 10 attempts. >>>>> Exceptions: >>>>> at >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166) >>>>> at >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247) >>>>> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609) >>>>> >>>>> In the master there are the following three regions :- >>>>> >>>>> source_documents,ipubmed\x219859228,1274701893687 hadoop1 >>>>> 1825870642 ipubmed\x219859228 ipubmed\x219915054 >>>>> source_documents,ipubmed\x219915054,1274525958679 hadoop4 >>>>> 193393334 ipubmed\x219915054 u102193588 >>>>> source_documents,u102193588,1274486550122 hadoop4 >>>>> 2141795358 u102193588 u105043522 >>>>> >>>>> and on one of our 5 nodes I found a region which start with >>>>> >>>>> ipubmed\x219915054 and ends with u102002564 >>>>> >>>>> and on another I found the other half of the split which starts with >>>>> >>>>> u102002564 and ends with u102193588 >>>>> >>>>> So it seems that the middle region on the master was split apart but >>>>> that failed to reach the master. >>>>> >>>>> We've had a few problems over the last few days with hdfs nodes >>>>> failing due to lack of memory which has now been fixed but could have >>>>> been a cause of this problem. >>>>> >>>>> What ways can a split fail to be received by the master and how long >>>>> would it take for hbase to fix this? I've read it periodically will >>>>> scan the META table to find problems like this but didn't say how >>>>> often? It has been about 12h here and our cluster didn't appear to >>>>> have fixed this missing split, is there a way to force the master to >>>>> rescan the META table? Will it fix problems like this given time? >>>>> >>>>> Thanks, >>>>> >>>>> -- >>>>> Dan Harvey | Datamining Engineer >>>>> www.mendeley.com/profiles/dan-harvey >>>>> >>>>> Mendeley Limited | London, UK | www.mendeley.com >>>>> Registered in England and Wales | Company Number 6419015 >>>>> >>>> >>> >>> -- >>> Dan Harvey | Datamining Engineer >>> www.mendeley.com/profiles/dan-harvey >>> >>> Mendeley Limited | London, UK | www.mendeley.com >>> Registered in England and Wales | Company Number 6419015 >>> >> > > -- > Dan Harvey | Datamining Engineer > www.mendeley.com/profiles/dan-harvey > > Mendeley Limited | London, UK | www.mendeley.com > Registered in England and Wales | Company Number 6419015 >
