recovering Accumulo instance from missing root WALs (deleted by gc)

Jonathan LASKO Fri, 21 Apr 2017 04:43:31 -0700

Hello Accumulo wizards,

I have a large schema of test data in an Accumulo instance that is currently 
inaccessible which I would like to recover, if possible. I'll explain the 
problem in hopes that some folks who know the intricacies of the Accumulo root 
table, WAL, and recovery processes can tell me whether there are any additional 
actions to take or whether I should treat this schema as hosed.


The problem is similar to what was reported here 
(https://community.hortonworks.com/questions/52718/failed-to-locate-tablet-for-table-0-row-err.html),
 i.e. no tablets are loaded except one from accumulo.root, and the logs are 
repeating these message rapidly:

==> monitor_stti-master.bbn.com.debug.log <==
2017-04-21 07:10:55,047 [impl.ThriftScanner] DEBUG:  Failed to locate tablet 
for table : !0 row : ~err_

==> master_stti-master.bbn.com.debug.log <==
2017-04-21 07:10:55,430 [master.Master] DEBUG: Finished gathering information 
from 13 servers in 0.03 seconds
2017-04-21 07:10:55,430 [master.Master] DEBUG: not balancing because there are 
unhosted tablets: 2

The RecoveryManager insists that it is trying to recover five WALs:

2017-04-21 07:28:48,349 [recovery.RecoveryManager] DEBUG: Recovering 
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a
 to 
hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/0d28801e-322e-44e6-97e3-a34a14b4bd1a
2017-04-21 07:28:48,358 [recovery.RecoveryManager] DEBUG: Recovering 
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9
 to 
hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/696d4353-0041-4397-a1f5-b8600b5cb2e9
2017-04-21 07:28:48,362 [recovery.RecoveryManager] DEBUG: Recovering 
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3
 to 
hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/e62f4195-c7d6-419a-a696-ff89b10cecc3
2017-04-21 07:28:48,366 [recovery.RecoveryManager] DEBUG: Recovering 
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a
 to 
hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/01a0887e-4ac8-4772-8f5f-b99371e1df0a
2017-04-21 07:28:48,369 [recovery.RecoveryManager] DEBUG: Recovering 
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105
 to 
hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/6f392ec5-821b-4fd5-83e4-baf1f47d8105

Based on the advice from the post linked above, I grepped the logs and was able 
to confirm that all five of those WALs were actually deleted (here's the output 
from my grep; note the earlier timestamps):

gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,275 
[gc.GarbageCollectWriteAheadLogs] DEBUG: deleted 
[hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105]
 from stti-data-102.bbn.com+10011
gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,280 
[gc.GarbageCollectWriteAheadLogs] DEBUG: deleted 
[hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3]
 from stti-data-103.bbn.com+10011
gc_stti-master.bbn.com.debug.log.3:2017-04-03 20:25:26,699 
[gc.GarbageCollectWriteAheadLogs] DEBUG: deleted 
[hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a]
 from stti-data-103.bbn.com+10011
gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:32:11,106 
[gc.GarbageCollectWriteAheadLogs] DEBUG: deleted 
[hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a]
 from stti-data-102.bbn.com+10011
gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:37:14,875 
[gc.GarbageCollectWriteAheadLogs] DEBUG: deleted 
[hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9]
 from stti-data-103.bbn.com+10011

All five WALs appear in references in the accumulo.root table:

!0;~ 
log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a
 []    
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a|1
!0;~ 
log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9
 []    
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9|1
!0;~ 
log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3
 []    
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3|1
...
!0< 
log:stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a
 []    
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a|1
!0< 
log:stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105
 []    
hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105|1

I also see observe three outstanding fate transactions (at least two of which 
appear to me to be related to the accumulo.root table):

root@bbn-beta> fate print
txid: 6b33fa130909f05d  status: IN_PROGRESS         op: CompactRange     
locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
txid: 564d758d584af61e  status: IN_PROGRESS         op: CompactRange     
locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
txid: 4a620317a53a4a93  status: IN_PROGRESS         op: CreateTable      
locked: [W:5e, R:+default] locking: []              top: PopulateMetadata

I checked in ZooKeeper and the /accumulo/$INSTANCE/root_tablet/walogs and 
/accumulo/$INSTANCE/recovery/[locks] directories are all empty.

I don't know exactly what to do at this point. I could:

a) Try deleting the fate operations and see if that releases the Accumulo 
instance.
b) Try deleting the accumulo.root table entries pointing to the already-deleted 
WALs.
c) Call it quits on this instance, blow it away, and start re-generating my 
test data over the weekend.

Given option (c), I would most likely try options (a) and (b) first (and 
probably in that order). But I would love to get some insight from the Accumulo 
experts first.

Thanks in advance,

Jonathan

recovering Accumulo instance from missing root WALs (deleted by gc)

Reply via email to