[ https://issues.apache.org/jira/browse/HBASE-13971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726175#comment-14726175 ]
Stephen Yuan Jiang commented on HBASE-13971: -------------------------------------------- [~ndimiduk] and [~tedyu], should the patch also backport to 1.1.x? Or should we wait for fix of HBASE-14317? > Flushes stuck since 6 hours on a regionserver. > ---------------------------------------------- > > Key: HBASE-13971 > URL: https://issues.apache.org/jira/browse/HBASE-13971 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.3.0 > Environment: Caused while running IntegrationTestLoadAndVerify for 20 > M rows on cluster with 32 region servers each with max heap size of 24GBs. > Reporter: Abhilash > Assignee: Ted Yu > Priority: Critical > Fix For: 2.0.0, 1.2.0, 1.3.0 > > Attachments: 13971-v1.txt, 13971-v1.txt, 13971-v1.txt, jstack.1, > jstack.2, jstack.3, jstack.4, jstack.5, rsDebugDump.txt, screenshot-1.png > > > One region server stuck while flushing(possible deadlock). Its trying to > flush two regions since last 6 hours (see the screenshot). > Caused while running IntegrationTestLoadAndVerify for 20 M rows with 600 > mapper jobs and 100 back references. ~37 Million writes on each regionserver > till now but no writes happening on any regionserver from past 6 hours and > their memstore size is zero(I dont know if this is related). But this > particular regionserver has memstore size of 9GBs from past 6 hours. > Relevant snaps from debug dump: > Tasks: > =========================================================== > Task: Flushing > IntegrationTestLoadAndVerify,R\x9B\x1B\xBF\xAE\x08\xD1\xA2,1435179555993.8e2d075f94ce7699f416ec4ced9873cd. > Status: RUNNING:Preparing to flush by snapshotting stores in > 8e2d075f94ce7699f416ec4ced9873cd > Running for 22034s > Task: Flushing > IntegrationTestLoadAndVerify,\x93\xA385\x81Z\x11\xE6,1435179555993.9f8d0e01a40405b835bf6e5a22a86390. > Status: RUNNING:Preparing to flush by snapshotting stores in > 9f8d0e01a40405b835bf6e5a22a86390 > Running for 22033s > Executors: > =========================================================== > ... > Thread 139 (MemStoreFlusher.1): > State: WAITING > Blocked count: 139711 > Waited count: 239212 > Waiting on java.util.concurrent.CountDownLatch$Sync@b9c094a > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305) > > org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422) > > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168) > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047) > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011) > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902) > org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259) > java.lang.Thread.run(Thread.java:745) > Thread 137 (MemStoreFlusher.0): > State: WAITING > Blocked count: 138931 > Waited count: 237448 > Waiting on java.util.concurrent.CountDownLatch$Sync@53f41f76 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305) > > org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422) > > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168) > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047) > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011) > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902) > org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75) > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259) > java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)