Andrew Purtell created HBASE-21358:
--------------------------------------

             Summary: Snapshot procedure fails but SnapshotManager thinks it is 
still running
                 Key: HBASE-21358
                 URL: https://issues.apache.org/jira/browse/HBASE-21358
             Project: HBase
          Issue Type: Bug
          Components: snapshots
    Affects Versions: 1.3.2
            Reporter: Andrew Purtell


A snapshot procedure fails due to chaotic test action but the snapshot manager 
still thinks it is running. The test client spins needlessly checking for 
something that will never actually complete. We give up eventually but we could 
be failing this a lot faster. 

On the integration client we are checking and re-checking: 

2018-10-20 01:06:11,718 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: Getting 
current status of snapshot from master... 
2018-10-20 01:06:11,719 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: (#40) 
Sleeping: 8571ms while waiting for snapshot completion. 

This is what it looks like on the master side each time the client checks in: 

2018-10-20 01:04:54,565 DEBUG 
[RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] 
master.MasterRpcServices: Checking to see if snapshot from request:{ 
ss=IntegrationTestBigLinkedList-it-1539997289258 
table=IntegrationTestBigLinkedList type=FLUSH } is done 
2018-10-20 01:04:54,565 DEBUG 
[RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] 
snapshot.SnapshotManager: Snapshoting '{ 
ss=IntegrationTestBigLinkedList-it-1539997289258 
table=IntegrationTestBigLinkedList type=FLUSH }' is still in progress! 

There is no running procedure for the snapshot. The procedure has failed. The 
snapshot manager does not take any useful action afterward but believes the 
snapshot to still be in progress.

I see related complaint from the hfile archiver task afterward, empty 
directories, failure to parse protobuf in descriptor files... Seems like there 
was junk in the filesystem left over from the failed snapshot. The master was 
soon restarted by chaos action, and now I don't see these complaints, so that 
partially complete snapshot may have been cleaned up.

This is with 1.3.2, but patched to include the multithreaded hfile archiving 
improvements from later versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to