Thanks for your response.

I checked namenode logs and I find following:

2013-08-28 15:25:24,025 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: recover
lease [Lease.  Holder:
DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25,
pendingcreates: 1],
src=/hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1377700014053-splitting/smartdeals-hbase14-snc1.snc1%2C60020%2C1377700014053.1377700015413
from client
DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25
2013-08-28 15:25:24,025 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering
lease=[Lease.  Holder:
DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25,
pendingcreates: 1],
src=/hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1377700014053-splitting/smartdeals-hbase14-snc1.snc1%2C60020%2C1377700014053.1377700015413
2013-08-28 15:25:24,025 WARN org.apache.hadoop.hdfs.StateChange: BLOCK*
internalReleaseLease: All existing blocks are COMPLETE, lease removed, file
closed.

There are LeaseException errors on namenode as well:
http://pastebin.com/4feVcL1F Not sure why its happening.

I do not think I am ending up with any timeouts, as my jobs fail within
couple of minutes, while all my time outs are 10 minutes+
Not sure why above would

Ameya



On Wed, Aug 28, 2013 at 9:00 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> From the log you posted on pastebin, I see the following.
> Can you check namenode log to see what went wrong ?
>
>
>    1. Caused by:
>    org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
> on
>
>  
> /hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1376944419197/smartdeals-hbase14-snc1.snc1%2C60020%2C1376944419197.1377699297514
>    File does not exist. [Lease.  Holder:
>
>  
> DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1376944419197_-413917755_25,
>    pendingcreates: 1]
>
>
>
> On Wed, Aug 28, 2013 at 8:00 AM, Ameya Kanitkar <am...@groupon.com> wrote:
>
> > HI All,
> >
> > We have a very heavy map reduce job that goes over entire table with over
> > 1TB+ data in HBase and exports all data (Similar to Export job but with
> > some additional custom code built in) to HDFS.
> >
> > However this job is not very stable, and often times we get following
> error
> > and job fails:
> >
> > org.apache.hadoop.hbase.regionserver.LeaseException:
> > org.apache.hadoop.hbase.regionserver.LeaseException: lease
> > '-4456594242606811626' does not exist
> >         at
> > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2429)
> >         at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >         at
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
> >         at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400)
> >
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> >         at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >         at java.lang.reflect.Constructor.newInstance(Constructor.
> >
> >
> > Here are more detailed logs on the RS: http://pastebin.com/xaHF4ksb
> >
> > We have changed following settings in HBase to counter this problem
> > but issue persists:
> >
> > <property>
> > <!-- Loaded from hbase-site.xml -->
> > <name>hbase.regionserver.lease.period</name>
> > <value>900000</value>
> > </property>
> >
> > <property>
> > <!-- Loaded from hbase-site.xml -->
> > <name>hbase.rpc.timeout</name>
> > <value>900000</value>
> > </property>
> >
> >
> > We also reduced number of mappers per RS less than available CPU's on the
> > box.
> >
> > We also observed that problem once happens, happens multiple times on
> > the same RS. All other regions are unaffected. But different RS
> > observes this problem on different days. There is no particular region
> > causing this either.
> >
> > We are running: 0.94.2 with cdh4.2.0
> >
> > Any ideas?
> >
> >
> > Ameya
> >
>

Reply via email to