[
https://issues.apache.org/jira/browse/HBASE-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-3139:
-------------------------
Attachment: 3139-v5.txt
This patch sets core == max threads and removes the special blocking queue we
had in place, the one that tried to pretend like it was at full capacity when
it was not. Patch also adds test that we don't throw exception when more than
maxThread items in queue. I've tested this patch on cluster and it seems to do
right thing now. I turned down some of the sizes in master and regionserver
since they were set when we thought things weren't running fast enough when in
actuality we were not getting any parallellism.
Chatting w/ Jon, we don't need a priority queue so I need to fix that. I tried
to use ExecutorService#newFixedThreadPool and newCachedThreadPool but these
exhibit same brain-deadedness that we saw where core < maxThreads.
Will put new patch soon.
> Server shutdown processor stuck because meta not online
> -------------------------------------------------------
>
> Key: HBASE-3139
> URL: https://issues.apache.org/jira/browse/HBASE-3139
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Fix For: 0.90.0
>
> Attachments: 3139-v5.txt, ExecutorServiceUnitTest.patch
>
>
> Playing with rolling restart I see that the server hosting root and meta can
> go down close to each other. In below, note how we are processing server
> hosting -ROOT- and part of its processing involves reading .META. content to
> see what servers it was carrying. Well, note that .META. is offline at time
> (our verification attempt failed because server had just been shutdown and
> verification got ConnectException). So we pause the server shutdown
> processing till .META. comes back online -- only it never does.
> {code}
> 2010-10-21 07:32:23,931 INFO
> org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region
> location in ZooKeeper
> 2010-10-21 07:32:23,953 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
> for sv2borg182,60020,1287645693959
> 2010-10-21 07:32:23,994 INFO
> org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region
> location in ZooKeeper
> 2010-10-21 07:32:24,020 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x12bcd5b344b0115 Creating (or updating) unassigned node for
> 70236052 with OFFLINE state
> 2010-10-21 07:32:24,045 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan
> for -ROOT-,,0.70236052 so generated a random one; hri=-ROOT-,,0.70236052,
> src=, dest=sv2borg181,60020,1287646329081; 8 (online=8, exclude=null)
> available servers
>
> 2010-10-21 07:32:24,045 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
> -ROOT-,,0.70236052 to sv2borg181,60020,1287646329081
> 2010-10-21 07:32:24,048
> INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
> .META.,,1; java.net.ConnectException: Connection refused
> 2010-10-21 07:32:24,048 INFO org.apache.hadoop.hbase.catalog.CatalogTracker:
> Current cached META location is not valid, resetting
> 2010-10-21 07:32:24,079 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENING, server=sv2borg181,60020,1287646329081,
> region=70236052/-ROOT- 2010-10-21
> 07:32:24,162 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENING, server=sv2borg181,60020,1287646329081,
> region=70236052/-ROOT-
> 2010-10-21 07:32:24,212 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_OPENED, server=sv2borg181,60020,1287646329081,
> region=70236052/-ROOT- 2010-10-21
> 07:32:24,212 DEBUG
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED
> event for 70236052; deleting unassigned node
> 2010-10-21 07:32:24,212 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x12bcd5b344b0115
> Deleting existing unassigned node for 70236052 that is in expected state
> RS_ZK_REGION_OPENED 2010-10-21 07:32:24,238
> DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened
> region -ROOT-,,0.70236052
> 2010-10-21 07:32:27,902
> INFO org.apache.hadoop.hbase.master.ServerManager: Registering
> server=sv2borg183,60020,1287646347597, regionCount=0, userLoad=false
> 2010-10-21
> 07:32:30,523 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
> RegionServer ephemeral node deleted, processing expiration
> [sv2borg184,60020,1287645693960]
> 2010-10-21 07:32:30,523 DEBUG
> org.apache.hadoop.hbase.master.ServerManager:
> Added=sv2borg184,60020,1287645693960 to dead servers, submitted shutdown
> handler to be executed
> 2010-10-21 07:32:36,254 INFO org.apache.hadoop.hbase.master.ServerManager:
> Registering server=sv2borg184,60020,1287646355951, regionCount=0,
> userLoad=false
> 2010-10-21 07:32:39,567 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral
> node deleted, processing expiration [sv2borg185,60020,1287645693959]
> 2010-10-21 07:32:39,567 DEBUG
> org.apache.hadoop.hbase.master.ServerManager:
> Added=sv2borg185,60020,1287645693959 to dead servers, submitted shutdown
> handler to be executed
> 2010-10-21 07:32:45,614 INFO org.apache.hadoop.hbase.master.ServerManager:
> Registering server=sv2borg185,60020,1287646365304, regionCount=0,
> userLoad=false
> 2010-10-21 07:32:48,652 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral
> node deleted, processing expiration [sv2borg186,60020,1287645693962]
> 2010-10-21 07:32:48,652 DEBUG
> org.apache.hadoop.hbase.master.ServerManager:
> Added=sv2borg186,60020,1287645693962 to dead servers, submitted shutdown
> handler to be executed
> 2010-10-21 07:32:50,097 INFO org.apache.hadoop.hbase.master.ServerManager:
> regionservers=8, averageload=93.38,
> deadservers=[sv2borg185,60020,1287645693959, sv2borg183,60020,1287645693959,
> sv2borg182,60020,1287645693959, sv2borg184,60020,1287645693960,
> sv2borg186,60020,1287645693962]
> ....
> {code}
> We're supposed to have a thread of 5 executors to handle server shutdowns. I
> see an executor stuck waiting on .META. but I dont see any others running.
> Odd. Trying to figure why executors are 1 only.
> {code}
> "MASTER_SERVER_OPERATIONS-sv2borg180:60000-1" daemon prio=10
> tid=0x0000000041dc7000 nid=0x50a4 in Object.wait() [0x00007f285d537000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:324)
> - locked <0x00007f286d150ce8> (a
> java.util.concurrent.atomic.AtomicBoolean)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:359)
> at
> org.apache.hadoop.hbase.catalog.MetaReader.getServerUserRegions(MetaReader.java:487)
> at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:115)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:150)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.