[ 
https://issues.apache.org/jira/browse/HBASE-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849592#comment-13849592
 ] 

Matteo Bertozzi commented on HBASE-10136:
-----------------------------------------

[~sershe] we're not talking about snapshots here. Currently snapshot are built 
to fail if a region is moving or is down, and this is by design. If you want to 
talk about how to fix this open another jira.

The problem here is the TableEventHandler and when the table lock is released,
for example if you call modifyTable() twice or you have a split concurrently 
with modifyTable() you don't get the expected behavior that we want with the 
table lock, which should be an operation on the table is locked until the other 
is completed.

also the other problem, not completly related, that I'm pointing out is that 
since we have this async complete the client is not synchronous

> Alter table conflicts with concurrent snapshot attempt on that table
> --------------------------------------------------------------------
>
>                 Key: HBASE-10136
>                 URL: https://issues.apache.org/jira/browse/HBASE-10136
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 0.96.0, 0.98.1, 0.99.0
>            Reporter: Aleksandr Shulman
>            Assignee: Matteo Bertozzi
>              Labels: online_schema_change
>
> Expected behavior:
> A user can issue a request for a snapshot of a table while that table is 
> undergoing an online schema change and expect that snapshot request to 
> complete correctly. Also, the same is true if a user issues a online schema 
> change request while a snapshot attempt is ongoing.
> Observed behavior:
> Snapshot attempts time out when there is an ongoing online schema change 
> because the region is closed and opened during the snapshot. 
> As a side-note, I would expect that the attempt should fail quickly as 
> opposed to timing out. 
> Further, what I have seen is that subsequent attempts to snapshot the table 
> fail because of some state/cleanup issues. This is also concerning.
> Immediate error:
> {code}type=FLUSH }' is still in progress!
> 2013-12-11 15:58:32,883 DEBUG [Thread-385] client.HBaseAdmin(2696): (#11) 
> Sleeping: 10000ms while waiting for snapshot completion.
> 2013-12-11 15:58:42,884 DEBUG [Thread-385] client.HBaseAdmin(2704): Getting 
> current status of snapshot from master...
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
> master.HMaster(2891): Checking to see if snapshot from request:{ ss=snapshot0 
> table=changeSchemaDuringSnapshot1386806258640 type=FLUSH } is done
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
> snapshot.SnapshotManager(374): Snapshoting '{ ss=snapshot0 
> table=changeSchemaDuringSnapshot1386806258640 type=FLUSH }' is still in 
> progress!
> Snapshot failure occurred
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'snapshot0' wasn't completed in expectedTime:60000 ms
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2713)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2638)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2602)
>       at 
> org.apache.hadoop.hbase.client.TestAdmin$BackgroundSnapshotThread.run(TestAdmin.java:1974){code}
> Likely root cause of error:
> {code}Exception in SnapshotSubprocedurePool
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
>  is closing
>       at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:314)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
>       at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
>       at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:1)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.hbase.NotServingRegionException: 
> changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
>  is closing
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5327)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5289)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:79)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:1)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>       ... 5 more{code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to