Umesh Agashe created HBASE-20330:
------------------------------------

             Summary: ProcedureExecutor.start() gets stuck in recover lease on 
store.
                 Key: HBASE-20330
                 URL: https://issues.apache.org/jira/browse/HBASE-20330
             Project: HBase
          Issue Type: Bug
          Components: proc-v2
    Affects Versions: 2.0.0-beta-2
            Reporter: Umesh Agashe
            Assignee: Umesh Agashe
             Fix For: 2.0.0


We have instance in our internal testing where master log is getting filled 
with following messages:
{code}
2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover 
lease on dfs file hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log
2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
Recovered lease, attempt=0 on 
file=hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log after 1ms
2018-04-02 17:11:17,574 WARN 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to read 
tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log - 
Invalid Trailer version. got 111 expected 1
2018-04-02 17:11:17,576 ERROR 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
id=19 already exists
org.apache.hadoop.fs.FileAlreadyExistsException: 
/hbase/MasterProcWALs/pv2-00000000000000000019.log for client 10.17.202.11 
already exists
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
{code}

Debugging it further with [~appy] and [~avirmani], we found that when 
WALProcedureStore#rollWriter() fails and returns false for some reason, it 
keeps looping continuously.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to