[ 
https://issues.apache.org/jira/browse/HBASE-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849412#comment-13849412
 ] 

Jonathan Hsieh commented on HBASE-10136:
----------------------------------------

I agree with needing rules -- the invariant I think we need here is that if an 
operation starts with a region in open state and is supposed to complete with 
the regions in open state, it must be open.  (or a suitable replacement must be 
open).

Currently I only see open/close/open conflicts.  (splits/alters, likely 
merges).  can we get away with just "fixing" those three operations so that 
their respective table locks are held until the opens complete?  Is the wait 
until handler completion needed for any other operations?





> Alter table conflicts with concurrent snapshot attempt on that table
> --------------------------------------------------------------------
>
>                 Key: HBASE-10136
>                 URL: https://issues.apache.org/jira/browse/HBASE-10136
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 0.96.0, 0.98.1, 0.99.0
>            Reporter: Aleksandr Shulman
>            Assignee: Matteo Bertozzi
>              Labels: online_schema_change
>
> Expected behavior:
> A user can issue a request for a snapshot of a table while that table is 
> undergoing an online schema change and expect that snapshot request to 
> complete correctly. Also, the same is true if a user issues a online schema 
> change request while a snapshot attempt is ongoing.
> Observed behavior:
> Snapshot attempts time out when there is an ongoing online schema change 
> because the region is closed and opened during the snapshot. 
> As a side-note, I would expect that the attempt should fail quickly as 
> opposed to timing out. 
> Further, what I have seen is that subsequent attempts to snapshot the table 
> fail because of some state/cleanup issues. This is also concerning.
> Immediate error:
> {code}type=FLUSH }' is still in progress!
> 2013-12-11 15:58:32,883 DEBUG [Thread-385] client.HBaseAdmin(2696): (#11) 
> Sleeping: 10000ms while waiting for snapshot completion.
> 2013-12-11 15:58:42,884 DEBUG [Thread-385] client.HBaseAdmin(2704): Getting 
> current status of snapshot from master...
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
> master.HMaster(2891): Checking to see if snapshot from request:{ ss=snapshot0 
> table=changeSchemaDuringSnapshot1386806258640 type=FLUSH } is done
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
> snapshot.SnapshotManager(374): Snapshoting '{ ss=snapshot0 
> table=changeSchemaDuringSnapshot1386806258640 type=FLUSH }' is still in 
> progress!
> Snapshot failure occurred
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'snapshot0' wasn't completed in expectedTime:60000 ms
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2713)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2638)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2602)
>       at 
> org.apache.hadoop.hbase.client.TestAdmin$BackgroundSnapshotThread.run(TestAdmin.java:1974){code}
> Likely root cause of error:
> {code}Exception in SnapshotSubprocedurePool
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
>  is closing
>       at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:314)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
>       at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
>       at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:1)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.hbase.NotServingRegionException: 
> changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
>  is closing
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5327)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5289)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:79)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:1)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>       ... 5 more{code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to