Any ideas? Anyone?
On Wed, Aug 28, 2013 at 9:36 AM, Ameya Kanitkar <am...@groupon.com> wrote: > Thanks for your response. > > I checked namenode logs and I find following: > > 2013-08-28 15:25:24,025 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: recover > lease [Lease. Holder: > DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25, > pendingcreates: 1], > src=/hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1377700014053-splitting/smartdeals-hbase14-snc1.snc1%2C60020%2C1377700014053.1377700015413 > from client > DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25 > 2013-08-28 15:25:24,025 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering > lease=[Lease. Holder: > DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1377700014053_-346895658_25, > pendingcreates: 1], > src=/hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1377700014053-splitting/smartdeals-hbase14-snc1.snc1%2C60020%2C1377700014053.1377700015413 > 2013-08-28 15:25:24,025 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > closed. > > There are LeaseException errors on namenode as well: > http://pastebin.com/4feVcL1F Not sure why its happening. > > I do not think I am ending up with any timeouts, as my jobs fail within > couple of minutes, while all my time outs are 10 minutes+ > Not sure why above would > > Ameya > > > > On Wed, Aug 28, 2013 at 9:00 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> From the log you posted on pastebin, I see the following. >> Can you check namenode log to see what went wrong ? >> >> >> 1. Caused by: >> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease >> on >> >> >> /hbase/.logs/smartdeals-hbase14-snc1.snc1,60020,1376944419197/smartdeals-hbase14-snc1.snc1%2C60020%2C1376944419197.1377699297514 >> File does not exist. [Lease. Holder: >> >> >> DFSClient_hb_rs_smartdeals-hbase14-snc1.snc1,60020,1376944419197_-413917755_25, >> pendingcreates: 1] >> >> >> >> On Wed, Aug 28, 2013 at 8:00 AM, Ameya Kanitkar <am...@groupon.com> >> wrote: >> >> > HI All, >> > >> > We have a very heavy map reduce job that goes over entire table with >> over >> > 1TB+ data in HBase and exports all data (Similar to Export job but with >> > some additional custom code built in) to HDFS. >> > >> > However this job is not very stable, and often times we get following >> error >> > and job fails: >> > >> > org.apache.hadoop.hbase.regionserver.LeaseException: >> > org.apache.hadoop.hbase.regionserver.LeaseException: lease >> > '-4456594242606811626' does not exist >> > at >> > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231) >> > at >> > >> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2429) >> > at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at >> > >> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) >> > at >> > >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400) >> > >> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> > Method) >> > at >> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >> > at >> > >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) >> > at java.lang.reflect.Constructor.newInstance(Constructor. >> > >> > >> > Here are more detailed logs on the RS: http://pastebin.com/xaHF4ksb >> > >> > We have changed following settings in HBase to counter this problem >> > but issue persists: >> > >> > <property> >> > <!-- Loaded from hbase-site.xml --> >> > <name>hbase.regionserver.lease.period</name> >> > <value>900000</value> >> > </property> >> > >> > <property> >> > <!-- Loaded from hbase-site.xml --> >> > <name>hbase.rpc.timeout</name> >> > <value>900000</value> >> > </property> >> > >> > >> > We also reduced number of mappers per RS less than available CPU's on >> the >> > box. >> > >> > We also observed that problem once happens, happens multiple times on >> > the same RS. All other regions are unaffected. But different RS >> > observes this problem on different days. There is no particular region >> > causing this either. >> > >> > We are running: 0.94.2 with cdh4.2.0 >> > >> > Any ideas? >> > >> > >> > Ameya >> > >> > >