I think it is not directly caused by the throttle. On the 2nd run on the non-throttle jar, the LeaseExpiredException shows up again(for big file). So it does seem like the exportSnapshot is not reliable for big file.
The weird thing is when I replace the jar and restart the cluster, the first run of the big table always succeed. But then the later run always fail with these LeaseExpiredException. Smaller table has no problem no matter how many times I re-run. Thanks Tian-Ying On Wed, Apr 30, 2014 at 2:24 PM, Tianying Chang <[email protected]> wrote: > Ted, > > it seems it is due to the Jira-11083: throttle bandwidth during snapshot > export <https://issues.apache.org/jira/browse/HBASE-11083> After I revert > it back, the job succeed again. It seems even when I set the throttle > bandwidth high, like 200M, iftop shows much lower value. Maybe the throttle > is sleeping longer than it supposed to? But I am not clear why a slow copy > job can cause LeaseExpiredException. Any idea? > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /hbase/.archive/rich_pin_data_v1/b50ab10bb4812acc2e9fa6c564c9adef/d/bac3c661a897466aaf1706a9e1bd9e9a > File does not exist. Holder DFSClient_NONMAPREDUCE_-2096088484_1 does not > have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2454) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2431) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:536) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:335) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$ > > > Thanks > Tian-Ying > > > On Wed, Apr 30, 2014 at 1:25 PM, Ted Yu <[email protected]> wrote: > >> Tianying: >> Have you checked audit log on namenode for deletion event corresponding to >> the files involved in LeaseExpiredException ? >> >> Cheers >> >> >> On Wed, Apr 30, 2014 at 10:44 AM, Tianying Chang <[email protected]> >> wrote: >> >> > This time re-run passed (although with many failed/retry tasks) with my >> > throttle bandwidth as 200M(although by iftop, it never reach close to >> that >> > number). Is there a way to increase the lease expire time for low >> throttle >> > bandwidth for individual export job? >> > >> > Thanks >> > Tian-Ying >> > >> > >> > >> > On Wed, Apr 30, 2014 at 10:17 AM, Tianying Chang <[email protected]> >> > wrote: >> > >> > > yes, I am using the bandwidth throttle feature. The export job of this >> > > table actually succeed for its first run. When I rerun it (for my >> robust >> > > testing) it seems never pass. I am wondering if it has some werid >> state >> > (I >> > > did clean up the target cluster even removed >> > > /hbase/.archive/rich_pint_data_v1 folder) >> > > >> > > It seems even if I set the throttle value really large, it still fail. >> > And >> > > I think even after I replace the jar back to the one without >> throttle, it >> > > still fail for re-run. >> > > >> > > Is there some way that I can increase the lease to be very large to >> test >> > > it out? >> > > >> > > >> > > >> > > On Wed, Apr 30, 2014 at 10:02 AM, Matteo Bertozzi < >> > [email protected] >> > > > wrote: >> > > >> > >> the file is the file in export, so you are creating that file. >> > >> do you have the bandwidth throttle on? >> > >> >> > >> I'm thinking that the file is slow writing: e.g. write(few bytes) >> wait >> > >> write(few bytes) >> > >> and on the wait your lease expire >> > >> or something like that can happen if your MR job is stuck in someway >> > (slow >> > >> machine or similar) and it is not writing within the lease timeout >> > >> >> > >> Matteo >> > >> >> > >> >> > >> >> > >> On Wed, Apr 30, 2014 at 9:53 AM, Tianying Chang <[email protected]> >> > >> wrote: >> > >> >> > >> > we are using >> > >> > >> > >> > Hadoop 2.0.0-cdh4.2.0 and hbase 0.94.7. We also backported several >> > >> snapshot >> > >> > related jira, e.g 10111(verify snapshot), 11083 (bandwidth >> throttle in >> > >> > exportSnapshot) >> > >> > >> > >> > I found when the LeaseExpiredException first reported, that file >> > indeed >> > >> > not there, and the map task retry. And I verifified couple minutes >> > >> later, >> > >> > that HFile does exist under /.archive. But the retry map task still >> > >> > complain the same error of file not exist... >> > >> > >> > >> > I will check the namenode log for the LeaseExpiredException. >> > >> > >> > >> > >> > >> > Thanks >> > >> > >> > >> > Tian-Ying >> > >> > >> > >> > >> > >> > On Wed, Apr 30, 2014 at 9:33 AM, Ted Yu <[email protected]> >> wrote: >> > >> > >> > >> > > Can you give us the hbase and hadoop releases you're using ? >> > >> > > >> > >> > > Can you check namenode log around the time LeaseExpiredException >> was >> > >> > > encountered ? >> > >> > > >> > >> > > Cheers >> > >> > > >> > >> > > >> > >> > > On Wed, Apr 30, 2014 at 9:20 AM, Tianying Chang < >> [email protected]> >> > >> > wrote: >> > >> > > >> > >> > > > Hi, >> > >> > > > >> > >> > > > When I export large table with 460+ regions, I saw the >> > >> exportSnapshot >> > >> > job >> > >> > > > fail sometime (not all the time). The error of the map task is >> > >> below: >> > >> > > But I >> > >> > > > verified the file highlighted below, it does exist. Smaller >> table >> > >> seems >> > >> > > > always pass. Any idea? Is it because it is too big and get >> session >> > >> > > timeout? >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >> > >> > > > No lease on >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> /hbase/.archive/rich_pin_data_v1/7713d5331180cb610834ba1c4ebbb9b3/d/eef3642f49244547bb6606d4d0f15f1f >> > >> > > > File does not exist. Holder DFSClient_NONMAPREDUCE_279781617_1 >> > does >> > >> > > > not have any open files. >> > >> > > > at >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) >> > >> > > > at >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) >> > >> > > > at >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) >> > >> > > > at >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) >> > >> > > > at >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) >> > >> > > > at >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) >> > >> > > > at org.apache.hadoop.ipc.ProtobufR >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > Thanks >> > >> > > > >> > >> > > > Tian-Ying >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> > >
