[ https://issues.apache.org/jira/browse/HBASE-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684364#comment-16684364 ]
Tak Lon (Stephen) Wu commented on HBASE-21358: ---------------------------------------------- @[~apurtell] may I try this one? In addition, do you know if there is any way to test this problem on a live cluster ? or would it be fine if I write a test to cover this case ? > Snapshot procedure fails but SnapshotManager thinks it is still running > ----------------------------------------------------------------------- > > Key: HBASE-21358 > URL: https://issues.apache.org/jira/browse/HBASE-21358 > Project: HBase > Issue Type: Bug > Components: snapshots > Affects Versions: 1.3.2 > Reporter: Andrew Purtell > Priority: Major > > A snapshot procedure fails due to chaotic test action but the snapshot > manager still thinks it is running. The test client spins needlessly checking > for something that will never actually complete. We give up eventually but we > could be failing this a lot faster. > On the integration client we are checking and re-checking: > 2018-10-20 01:06:11,718 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: Getting > current status of snapshot from master... > 2018-10-20 01:06:11,719 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: (#40) > Sleeping: 8571ms while waiting for snapshot completion. > This is what it looks like on the master side each time the client checks in: > 2018-10-20 01:04:54,565 DEBUG > [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] > master.MasterRpcServices: Checking to see if snapshot from request:{ > ss=IntegrationTestBigLinkedList-it-1539997289258 > table=IntegrationTestBigLinkedList type=FLUSH } is done > 2018-10-20 01:04:54,565 DEBUG > [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] > snapshot.SnapshotManager: Snapshoting '{ > ss=IntegrationTestBigLinkedList-it-1539997289258 > table=IntegrationTestBigLinkedList type=FLUSH }' is still in progress! > There is no running procedure for the snapshot. The procedure has failed. The > snapshot manager does not take any useful action afterward but believes the > snapshot to still be in progress. > I see related complaint from the hfile archiver task afterward, empty > directories, failure to parse protobuf in descriptor files... Seems like > there was junk in the filesystem left over from the failed snapshot. The > master was soon restarted by chaos action, and now I don't see these > complaints, so that partially complete snapshot may have been cleaned up. > This is with 1.3.2, but patched to include the multithreaded hfile archiving > improvements from later versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)