[ https://issues.apache.org/jira/browse/HBASE-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303902#comment-14303902 ]
stack commented on HBASE-12958: ------------------------------- Odd. We go from acknowledging that the host with meta is down (c2022) to of a sudden forgetting about it: {code} 2015-02-02 22:32:15,574 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [c2022.halxg.cloudera.com,16020,1422944918568] 2015-02-02 22:32:15,575 DEBUG [main-EventThread] master.AssignmentManager: based on AM, current region=hbase:meta,,1.1588230740 is on server=c2022.halxg.cloudera.com,16020,1422944918568 server being checked: c2022.halxg.cloudera.com,16020,1422944918568 2015-02-02 22:32:15,575 DEBUG [main-EventThread] master.ServerManager: Added=c2022.halxg.cloudera.com,16020,1422944918568 to dead servers, submitted shutdown handler to be executed meta=true 2015-02-02 22:32:15,576 INFO [MASTER_META_SERVER_OPERATIONS-c2020:16020-1] handler.MetaServerShutdownHandler: Splitting hbase:meta logs for c2022.halxg.cloudera.com,16020,1422944918568 2015-02-02 22:32:15,577 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase/rs/c2023.halxg.cloudera.com,16020,1422945128068 2015-02-02 22:32:15,578 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase/rs/c2025.halxg.cloudera.com,16020,1422935795768 2015-02-02 22:32:15,578 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase/rs/c2024.halxg.cloudera.com,16020,1422944894206 2015-02-02 22:32:15,585 DEBUG [MASTER_META_SERVER_OPERATIONS-c2020:16020-1] master.MasterFileSystem: Renamed region directory: hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1422944918568-splitting 2015-02-02 22:32:15,585 INFO [MASTER_META_SERVER_OPERATIONS-c2020:16020-1] master.SplitLogManager: dead splitlog workers [c2022.halxg.cloudera.com,16020,1422944918568] 2015-02-02 22:32:15,587 DEBUG [MASTER_META_SERVER_OPERATIONS-c2020:16020-1] master.SplitLogManager: Scheduling batch of logs to split 2015-02-02 22:32:15,587 INFO [MASTER_META_SERVER_OPERATIONS-c2020:16020-1] master.SplitLogManager: started splitting 1 logs in [hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1422944918568-splitting] for [c2022.halxg.cloudera.com,16020,1422944918568] 2015-02-02 22:32:15,591 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination: put up splitlog task at znode /hbase/splitWAL/WALs%2Fc2022.halxg.cloudera.com%2C16020%2C1422944918568-splitting%2Fc2022.halxg.cloudera.com%252C16020%252C1422944918568..meta.1422945128892.meta 2015-02-02 22:32:15,591 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination: task not yet acquired /hbase/splitWAL/WALs%2Fc2022.halxg.cloudera.com%2C16020%2C1422944918568-splitting%2Fc2022.halxg.cloudera.com%252C16020%252C1422944918568..meta.1422945128892.meta ver = 0 2015-02-02 22:32:15,607 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fc2022.halxg.cloudera.com%2C16020%2C1422944918568-splitting%2Fc2022.halxg.cloudera.com%252C16020%252C1422944918568..meta.1422945128892.meta acquired by c2025.halxg.cloudera.com,16020,1422935795768 2015-02-02 22:32:15,929 INFO [c2020.halxg.cloudera.com,16020,1422944946802.splitLogManagerTimeoutMonitor] coordination.SplitLogManagerCoordination: resubmitting task /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945068674 2015-02-02 22:32:15,941 INFO [c2020.halxg.cloudera.com,16020,1422944946802.splitLogManagerTimeoutMonitor] master.SplitLogManager: resubmitted 1 out of 3 tasks 2015-02-02 22:32:15,941 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination: task not yet acquired /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945068674 ver = 3 2015-02-02 22:32:15,949 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/RESCAN0000004442 entered state: DONE c2020.halxg.cloudera.com,16020,1422944946802 2015-02-02 22:32:15,957 DEBUG [main-EventThread] coordination.ZKSplitLogManagerCoordination$DeleteAsyncCallback: deleted /hbase/splitWAL/RESCAN0000004442 2015-02-02 22:32:15,957 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination: deleted task without in memory state /hbase/splitWAL/RESCAN0000004442 2015-02-02 22:32:16,007 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945068674 acquired by c2024.halxg.cloudera.com,16020,1422944894206 2015-02-02 22:32:16,208 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945090502 entered state: DONE c2023.halxg.cloudera.com,16020,1422945128068 2015-02-02 22:32:16,219 INFO [main-EventThread] wal.WALSplitter: Archived processed log hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2021.halxg.cloudera.com,16020,1422944889403-splitting/c2021.halxg.cloudera.com%2C16020%2C1422944889403.default.1422945090502 to hdfs://c2020.halxg.cloudera.com:8020/hbase/oldWALs/c2021.halxg.cloudera.com%2C16020%2C1422944889403.default.1422945090502 2015-02-02 22:32:16,220 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945090502 2015-02-02 22:32:16,224 DEBUG [main-EventThread] coordination.ZKSplitLogManagerCoordination$DeleteAsyncCallback: deleted /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945090502 2015-02-02 22:32:17,225 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945068674 entered state: DONE c2024.halxg.cloudera.com,16020,1422944894206 2015-02-02 22:32:17,235 INFO [main-EventThread] wal.WALSplitter: Archived processed log hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2021.halxg.cloudera.com,16020,1422944889403-splitting/c2021.halxg.cloudera.com%2C16020%2C1422944889403.default.1422945068674 to hdfs://c2020.halxg.cloudera.com:8020/hbase/oldWALs/c2021.halxg.cloudera.com%2C16020%2C1422944889403.default.1422945068674 2015-02-02 22:32:17,236 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945068674 2015-02-02 22:32:17,241 DEBUG [main-EventThread] coordination.ZKSplitLogManagerCoordination$DeleteAsyncCallback: deleted /hbase/splitWAL/WALs%2Fc2021.halxg.cloudera.com%2C16020%2C1422944889403-splitting%2Fc2021.halxg.cloudera.com%252C16020%252C1422944889403.default.1422945068674 2015-02-02 22:32:17,244 INFO [MASTER_SERVER_OPERATIONS-c2020:16020-2] master.SplitLogManager: finished splitting (more than or equal to) 1955009811 bytes in 16 log files in [hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2021.halxg.cloudera.com,16020,1422944889403-splitting] in 7355ms 2015-02-02 22:32:17,244 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] master.RegionStates: Adding to processed servers c2021.halxg.cloudera.com,16020,1422944889403 2015-02-02 22:32:17,244 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] master.RegionStates: Offline splitting/merging region {7e0c50a2296895f66bee85df0eaba533 state=SPLITTING, ts=1422945115260, server=c2021.halxg.cloudera.com,16020,1422944889403} 2015-02-02 22:32:17,553 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Use SIMPLE authentication for service ClientService, sasl=false 2015-02-02 22:32:17,553 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Connecting to c2022.halxg.cloudera.com/10.20.84.28:16020 2015-02-02 22:32:17,554 INFO [MASTER_SERVER_OPERATIONS-c2020:16020-2] client.RpcRetryingCaller: Call exception, tries=1, retries=350, started=305 ms ago, cancelled=false, msg=row '54d2e8d62ae8d9cae3abc33f1b58e32c' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=c2022.halxg.cloudera.com,16020,1422944918568, seqNum=0 2015-02-02 22:32:18,055 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Use SIMPLE authentication for service ClientService, sasl=false 2015-02-02 22:32:18,055 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Not trying to connect to c2022.halxg.cloudera.com/10.20.84.28:16020 this server is in the failed servers list 2015-02-02 22:32:18,056 INFO [MASTER_SERVER_OPERATIONS-c2020:16020-2] client.RpcRetryingCaller: Call exception, tries=2, retries=350, started=807 ms ago, cancelled=false, msg=row '54d2e8d62ae8d9cae3abc33f1b58e32c' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=c2022.halxg.cloudera.com,16020,1422944918568, seqNum=0 2015-02-02 22:32:18,930 INFO [c2020.halxg.cloudera.com,16020,1422944946802.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fc2022.halxg.cloudera.com%2C16020%2C1422944918568-splitting%2Fc2022.halxg.cloudera.com%252C16020%252C1422944918568..meta.1422945128892.meta=last_update = 1422945135625 last_version = 2 cur_worker_name = c2025.halxg.cloudera.com,16020,1422935795768 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2015-02-02 22:32:19,059 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Use SIMPLE authentication for service ClientService, sasl=false 2015-02-02 22:32:19,059 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Not trying to connect to c2022.halxg.cloudera.com/10.20.84.28:16020 this server is in the failed servers list 2015-02-02 22:32:19,059 INFO [MASTER_SERVER_OPERATIONS-c2020:16020-2] client.RpcRetryingCaller: Call exception, tries=3, retries=350, started=1810 ms ago, cancelled=false, msg=row '54d2e8d62ae8d9cae3abc33f1b58e32c' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=c2022.halxg.cloudera.com,16020,1422944918568, seqNum=0 2015-02-02 22:32:20,258 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fc2022.halxg.cloudera.com%2C16020%2C1422944918568-splitting%2Fc2022.halxg.cloudera.com%252C16020%252C1422944918568..meta.1422945128892.meta entered state: DONE c2025.halxg.cloudera.com,16020,1422935795768 2015-02-02 22:32:20,269 INFO [main-EventThread] wal.WALSplitter: Archived processed log hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1422944918568-splitting/c2022.halxg.cloudera.com%2C16020%2C1422944918568..meta.1422945128892.meta to hdfs://c2020.halxg.cloudera.com:8020/hbase/oldWALs/c2022.halxg.cloudera.com%2C16020%2C1422944918568..meta.1422945128892.meta 2015-02-02 22:32:20,269 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fc2022.halxg.cloudera.com%2C16020%2C1422944918568-splitting%2Fc2022.halxg.cloudera.com%252C16020%252C1422944918568..meta.1422945128892.meta 2015-02-02 22:32:20,273 WARN [MASTER_META_SERVER_OPERATIONS-c2020:16020-1] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1422944918568-splitting 2015-02-02 22:32:20,273 INFO [MASTER_META_SERVER_OPERATIONS-c2020:16020-1] master.SplitLogManager: finished splitting (more than or equal to) 83 bytes in 1 log files in [hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2022.halxg.cloudera.com,16020,1422944918568-splitting] in 4686ms 2015-02-02 22:32:20,274 DEBUG [main-EventThread] coordination.ZKSplitLogManagerCoordination$DeleteAsyncCallback: deleted /hbase/splitWAL/WALs%2Fc2022.halxg.cloudera.com%2C16020%2C1422944918568-splitting%2Fc2022.halxg.cloudera.com%252C16020%252C1422944918568..meta.1422945128892.meta 2015-02-02 22:32:21,051 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase/rs/c2021.halxg.cloudera.com,16020,1422945139174 2015-02-02 22:32:21,052 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase/rs/c2023.halxg.cloudera.com,16020,1422945128068 2015-02-02 22:32:21,052 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase/rs/c2025.halxg.cloudera.com,16020,1422935795768 2015-02-02 22:32:21,052 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase/rs/c2024.halxg.cloudera.com,16020,1422944894206 2015-02-02 22:32:21,071 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Use SIMPLE authentication for service ClientService, sasl=false 2015-02-02 22:32:21,071 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16020-2] ipc.AbstractRpcClient: Connecting to c2022.halxg.cloudera.com/10.20.84.28:16020 2015-02-02 22:32:21,072 INFO [MASTER_SERVER_OPERATIONS-c2020:16020-2] client.RpcRetryingCaller: Call exception, tries=4, retries=350, started=3823 ms ago, cancelled=false, msg=row '54d2e8d62ae8d9cae3abc33f1b58e32c' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=c2022.halxg.cloudera.com,16020,1422944918568, seqNum=0 2015-02-02 22:32:21,111 DEBUG [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: connection from 10.20.84.27:35695; # active connections: 6 {code} [~jxiang] Any input boss? Why would we split the meta log and then not assign hbase:meta? Thanks sir. > SSH doing hbase:meta get but hbase:meta not assigned > ---------------------------------------------------- > > Key: HBASE-12958 > URL: https://issues.apache.org/jira/browse/HBASE-12958 > Project: HBase > Issue Type: Bug > Affects Versions: 1.0.0 > Reporter: stack > Assignee: stack > > All master threads are blocked waiting on this call to return: > {code} > "MASTER_SERVER_OPERATIONS-c2020:16020-2" #189 prio=5 os_prio=0 > tid=0x00007f4b0408b000 nid=0x7821 in Object.wait() [0x00007f4ada24d000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:168) > - locked <0x000000041c374f50> (a > java.util.concurrent.atomic.AtomicBoolean) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:881) > at > org.apache.hadoop.hbase.MetaTableAccessor.get(MetaTableAccessor.java:208) > at > org.apache.hadoop.hbase.MetaTableAccessor.getRegionLocation(MetaTableAccessor.java:250) > at > org.apache.hadoop.hbase.MetaTableAccessor.getRegion(MetaTableAccessor.java:225) > at > org.apache.hadoop.hbase.master.RegionStates.serverOffline(RegionStates.java:634) > - locked <0x000000041c1f0d80> (a > org.apache.hadoop.hbase.master.RegionStates) > at > org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:3298) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:226) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Master is stuck trying to find hbase:meta on the server that just crashed and > that we just recovered: > Mon Feb 02 23:00:02 PST 2015, null, java.net.SocketTimeoutException: > callTimeout=60000, callDuration=68181: row '' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=c2022.halxg.cloudera.com,16020,1422944918568, seqNum=0 > Will add more detail in a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)