[ 
https://issues.apache.org/jira/browse/HBASE-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535633#comment-13535633
 ] 

ramkrishna.s.vasudevan commented on HBASE-7385:
-----------------------------------------------

Nice one. :)
                
> Do not abort regionserver if StoreFlusher.flushCache() fails
> ------------------------------------------------------------
>
>                 Key: HBASE-7385
>                 URL: https://issues.apache.org/jira/browse/HBASE-7385
>             Project: HBase
>          Issue Type: Improvement
>          Components: io, regionserver
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>
> A rare NN failover may cause RS abort, in the following sequence of events: 
>  - RS tries to flush the memstore
>  - Create a file, start block, and acquire a lease
>  - Block is complete, lease removed, but before we send the RPC response back 
> to the client, NN is killed.
>  - New NN comes up, client retries the block complete again, the new NN 
> throws lease expired since the block was already complete.
>  - RS receives the exception, and aborts.
> This is actually a NN+DFSClient issue that, the dfs client from RS does not 
> receive the rpc response about the block close, and upon retry on the new NN, 
> it gets the exception, since the file was already closed. However, although 
> this is DFS client specific, we can also make RS more resilient by not 
> aborting the RS upon exception from the flushCache(). We can change 
> StoreFlusher so that: 
> StoreFlusher.prepare() will become idempotent (so will Memstore.snapshot())
> StoreFlusher.flushCache() will throw with IOException upon DFS exception, but 
> we catch IOException, and just abort the flush request (not RS).
> StoreFlusher.commit() still cause RS abort on exception. This is also 
> debatable. If dfs is alive, and we can undo the flush changes, than we should 
> not abort. 
> logs: 
> {code}
> org.apache.hadoop.hbase.DroppedSnapshotException: region: 
> loadtest_ha,e6666658,1355820729877.298bcbd550b80507a379fe67eefbe5ea.
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1485)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1364)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:896)
>       at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:845)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:119)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
> /apps/hbase/data/loadtest_ha/298bcbd550b80507a379fe67eefbe5ea/.tmp/5cf8951ee12449ce8e4e6dd0bf1645c2
>  File is not open for writing. [Lease.  Holder: 
> DFSClient_hb_rs_XXX,60020,1355813552066_203591774_25, pendingcreates: 1]
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1724)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1707)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1762)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1750)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:779)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1107)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>       at $Proxy10.complete(Unknown Source)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
>       at $Proxy10.complete(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:4087)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3988)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
>       at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:255)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:432)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:1214)
>       at 
> org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:762)
>       at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:674)
>       at org.apache.hadoop.hbase.regionserver.Store.access$400(Store.java:109)
>       at 
> org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:2286)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1460)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to