[jira] [Commented] (HBASE-7385) Do not abort regionserver if StoreFlusher.flushCache() fails

Enis Soztutar (JIRA) Wed, 19 Dec 2012 11:01:13 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536304#comment-13536304
 ]


Enis Soztutar commented on HBASE-7385:
--------------------------------------

bq. How will Memstore.snapshot() become idempotent?
snapshot already checks for whether there is a previous snapshot lying around, 
we can just remove the warning log. If flush is aborted, the memstore snapshot 
will live on, and new snapshot cannot be taken unless this one has been 
flushed. We might have to adjust the memstore size() calculations. 

                
> Do not abort regionserver if StoreFlusher.flushCache() fails
> ------------------------------------------------------------
>
>                 Key: HBASE-7385
>                 URL: https://issues.apache.org/jira/browse/HBASE-7385
>             Project: HBase
>          Issue Type: Improvement
>          Components: io, regionserver
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>
> A rare NN failover may cause RS abort, in the following sequence of events: 
>  - RS tries to flush the memstore
>  - Create a file, start block, and acquire a lease
>  - Block is complete, lease removed, but before we send the RPC response back 
> to the client, NN is killed.
>  - New NN comes up, client retries the block complete again, the new NN 
> throws lease expired since the block was already complete.
>  - RS receives the exception, and aborts.
> This is actually a NN+DFSClient issue that, the dfs client from RS does not 
> receive the rpc response about the block close, and upon retry on the new NN, 
> it gets the exception, since the file was already closed. However, although 
> this is DFS client specific, we can also make RS more resilient by not 
> aborting the RS upon exception from the flushCache(). We can change 
> StoreFlusher so that: 
> StoreFlusher.prepare() will become idempotent (so will Memstore.snapshot())
> StoreFlusher.flushCache() will throw with IOException upon DFS exception, but 
> we catch IOException, and just abort the flush request (not RS).
> StoreFlusher.commit() still cause RS abort on exception. This is also 
> debatable. If dfs is alive, and we can undo the flush changes, than we should 
> not abort. 
> logs: 
> {code}
> org.apache.hadoop.hbase.DroppedSnapshotException: region: 
> loadtest_ha,e6666658,1355820729877.298bcbd550b80507a379fe67eefbe5ea.
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1485)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1364)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:896)
>       at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:845)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:119)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
> /apps/hbase/data/loadtest_ha/298bcbd550b80507a379fe67eefbe5ea/.tmp/5cf8951ee12449ce8e4e6dd0bf1645c2
>  File is not open for writing. [Lease.  Holder: 
> DFSClient_hb_rs_XXX,60020,1355813552066_203591774_25, pendingcreates: 1]
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1724)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1707)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1762)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1750)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:779)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1107)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>       at $Proxy10.complete(Unknown Source)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
>       at $Proxy10.complete(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:4087)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3988)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
>       at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:255)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:432)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:1214)
>       at 
> org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:762)
>       at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:674)
>       at org.apache.hadoop.hbase.regionserver.Store.access$400(Store.java:109)
>       at 
> org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:2286)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1460)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7385) Do not abort regionserver if StoreFlusher.flushCache() fails

Reply via email to