Thanks for your response. I checked namenode logs and I find following:
2013-08-28 15:25:24,025 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: recover lease [Lease. Holder: DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25, pendingcreates: 1], src=/hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1377700014053-splitting/smartdeals-hbase14-snc1.snc1%2C60020%2C1377700014053.1377700015413 from client DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25 2013-08-28 15:25:24,025 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25, pendingcreates: 1], src=/hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1377700014053-splitting/smartdeals-hbase14-snc1.snc1%2C60020%2C1377700014053.1377700015413 2013-08-28 15:25:24,025 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file closed. There are LeaseException errors on namenode as well: http://pastebin.com/4feVcL1F Not sure why its happening. I do not think I am ending up with any timeouts, as my jobs fail within couple of minutes, while all my time outs are 10 minutes+ Not sure why above would Ameya On Wed, Aug 28, 2013 at 9:00 AM, Ted Yu <yuzhih...@gmail.com> wrote: > From the log you posted on pastebin, I see the following. > Can you check namenode log to see what went wrong ? > > > 1. Caused by: > org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease > on > > > /hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1376944419197/smartdeals-hbase14-snc1.snc1%2C60020%2C1376944419197.1377699297514 > File does not exist. [Lease. Holder: > > > DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1376944419197_-413917755_25, > pendingcreates: 1] > > > > On Wed, Aug 28, 2013 at 8:00 AM, Ameya Kanitkar <am...@groupon.com> wrote: > > > HI All, > > > > We have a very heavy map reduce job that goes over entire table with over > > 1TB+ data in HBase and exports all data (Similar to Export job but with > > some additional custom code built in) to HDFS. > > > > However this job is not very stable, and often times we get following > error > > and job fails: > > > > org.apache.hadoop.hbase.regionserver.LeaseException: > > org.apache.hadoop.hbase.regionserver.LeaseException: lease > > '-4456594242606811626' does not exist > > at > > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231) > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2429) > > at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400) > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > Method) > > at > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > > at > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > > at java.lang.reflect.Constructor.newInstance(Constructor. > > > > > > Here are more detailed logs on the RS: http://pastebin.com/xaHF4ksb > > > > We have changed following settings in HBase to counter this problem > > but issue persists: > > > > <property> > > <!-- Loaded from hbase-site.xml --> > > <name>hbase.regionserver.lease.period</name> > > <value>900000</value> > > </property> > > > > <property> > > <!-- Loaded from hbase-site.xml --> > > <name>hbase.rpc.timeout</name> > > <value>900000</value> > > </property> > > > > > > We also reduced number of mappers per RS less than available CPU's on the > > box. > > > > We also observed that problem once happens, happens multiple times on > > the same RS. All other regions are unaffected. But different RS > > observes this problem on different days. There is no particular region > > causing this either. > > > > We are running: 0.94.2 with cdh4.2.0 > > > > Any ideas? > > > > > > Ameya > > >