[ 
https://issues.apache.org/jira/browse/ACCUMULO-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820400#comment-16820400
 ] 

Michael Miller commented on ACCUMULO-3898:
------------------------------------------

[~elserj] are you still seeing this in 1.8 or 1.9?

> `admin stop` created deadlock in TabletServer being stopped
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-3898
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3898
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: 0001-ACCUMULO-3898-Test-to-reproduce.patch
>
>
> Take a tabletserver, tserver1, one of a few.
> The other tabletservers have already been requested to stop and all have 
> stopped (via {{admin stop}}).
> {noformat}
> 2015-06-10 21:07:40,485 [master.Master] DEBUG: Seeding FATE op to shutdown 
> tserver1:9997[24ddf314d8c0010] with tid 9215266058859846639
> 2015-06-10 21:07:40,503 [master.EventCoordinator] INFO : Tablet Server 
> shutdown requested for tserver1:9997[24ddf314d8c0010]
> {noformat}
> The master FATE op gets seeded to shutdown tserver1. As we can guess, since 
> it's the last server, it's hosting {{r<<}}. It unloads it
> {noformat}
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) +r<<
> 2015-06-10 21:07:40,581 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) +r<<
> 2015-06-10 21:07:40,587 [tserver.TabletServer] DEBUG: MultiScanSess 
> master:58287 1 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x0000000004df6fa0
> 2015-06-10 21:07:40,608 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x0000000004df7040
> 2015-06-10 21:07:40,609 [tserver.NativeMap] DEBUG: Deallocating native map 
> 0x00000000022b4da0
> 2015-06-10 21:07:40,609 [tablet.Tablet] TABLET_HIST: +r<< closed
> 2015-06-10 21:07:40,609 [tserver.TabletServer] DEBUG: Unassigning 
> +r<<@(null,tserver1:9997[24ddf314d8c0010],null)
> 2015-06-10 21:07:40,610 [state.ZooStore] DEBUG: Removing /root_tablet/location
> 2015-06-10 21:07:40,613 [state.ZooStore] DEBUG: Removing 
> /root_tablet/future_location
> 2015-06-10 21:07:40,614 [state.ZooTabletStateStore] DEBUG: unassign root 
> tablet location
> 2015-06-10 21:07:40,619 [tserver.TabletServer] INFO : unloaded +r<<
> {noformat}
> After this, however, we can see it tries to run a minor compaction on a user 
> tablet (1<<)
> {noformat}
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) !0;~<
> 2015-06-10 21:07:40,621 [tablet.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) !0;~<
> 2015-06-10 21:07:40,622 [tablet.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) 1<<
> 2015-06-10 21:07:40,623 [tserver.NativeMap] DEBUG: Allocated native map 
> 0x0000000002e82e00
> 2015-06-10 21:07:40,624 [impl.ThriftScanner] DEBUG: Failed to locate tablet 
> for table : +r row : !0;
> 2015-06-10 21:07:40,633 [tablet.MinorCompactor] DEBUG: Begin minor compaction 
> hdfs://namenode:8020/apps/accumulo/data/tables/1/default_tablet/F0000007.rf_tmp
>  1<<
> 2015-06-10 21:07:40,759 [tablet.Compactor] DEBUG: Compaction 1<< 590 read | 
> 590 written | 12,553 entries/sec |  0.047 secs
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs for memory compacted: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,767 [tablet.Tablet] DEBUG: Logs to be destroyed: 1<< 
> tserver1:9997/hdfs://namenode:8020/apps/accumulo/data/wal/tserver1+9997/09dc5a75-b9a1-45e9-bd5b-118ef811e99f
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: 
> BEFORE compactionThreshold = 0.550 maxObserved = 150,256
> 2015-06-10 21:07:40,777 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER 
> compactionThreshold = 0.605
> {noformat}
> So, as we might expect, this is never going to finish. The close won't happen 
> because root isn't assigned, and the master isn't going to reassign root 
> unless it gets a new tserver (which it won't because we're trying to stop the 
> entire instance).
> To verify this is what's gumming up the works:
> {noformat}
> "tablet migration 1" daemon prio=10 tid=0x0000000002e84800 nid=0x671 waiting 
> on condition [0x00007fb80a03a000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>       at java.lang.Thread.sleep(Native Method)
>       at 
> org.apache.accumulo.core.util.UtilWaitThread.sleep(UtilWaitThread.java:27)
>       at 
> org.apache.accumulo.core.client.impl.RootTabletLocator.locateTablet(RootTabletLocator.java:129)
>       at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:460)
>       at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocationAndCheckLock(TabletLocatorImpl.java:634)
>       at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:620)
>       at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:439)
>       at org.apache.accumulo.core.client.impl.Writer.update(Writer.java:88)
>       at 
> org.apache.accumulo.server.util.MetadataTableUtil.update(MetadataTableUtil.java:153)
>       at 
> org.apache.accumulo.server.util.MetadataTableUtil.update(MetadataTableUtil.java:145)
>       at 
> org.apache.accumulo.server.util.MasterMetadataUtil.updateTabletDataFile(MasterMetadataUtil.java:257)
>       at 
> org.apache.accumulo.tserver.tablet.Tablet.updateTabletDataFile(Tablet.java:2651)
>       - locked <0x00000000ab6fa2d8> (a java.lang.Object)
>       at 
> org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:417)
>       at 
> org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:956)
>       at 
> org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:84)
>       at 
> org.apache.accumulo.tserver.tablet.Tablet.initiateClose(Tablet.java:1406)
>       at org.apache.accumulo.tserver.tablet.Tablet.close(Tablet.java:1337)
>       at 
> org.apache.accumulo.tserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:1999)
>       at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>       at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>       at java.lang.Thread.run(Thread.java:745)
>    Locked ownable synchronizers:
>       - <0x00000000ab6f7bd8> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>       - <0x00000000ab6fb8f8> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>       - <0x00000000aba02208> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> "metadata tablet migration 1" daemon prio=10 tid=0x0000000002e84000 nid=0x670 
> waiting on condition [0x00007fb809f3a000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>       at java.lang.Thread.sleep(Native Method)
>       at 
> org.apache.accumulo.core.client.impl.ThriftScanner.pause(ThriftScanner.java:200)
>       at 
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:248)
>       at 
> org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80)
>       at 
> org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:151)
>       at 
> org.apache.accumulo.server.util.MetadataTableUtil.getFileAndLogEntries(MetadataTableUtil.java:529)
>       at 
> org.apache.accumulo.tserver.tablet.Tablet.closeConsistencyCheck(Tablet.java:1505)
>       at 
> org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1455)
>       - locked <0x00000000ab6ff8a8> (a 
> org.apache.accumulo.tserver.tablet.Tablet)
>       at org.apache.accumulo.tserver.tablet.Tablet.close(Tablet.java:1338)
>       at 
> org.apache.accumulo.tserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:1999)
>       at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>       at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>       at java.lang.Thread.run(Thread.java:745)
>    Locked ownable synchronizers:
>       - <0x00000000ab9e8db0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> We can see that a Tablet close, specfically a Tablet close due to a 
> migration, is trying to find the root tablet so it can write the result of 
> the compaction to the metadata table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to