hf200012 opened a new issue #5292:
URL: https://github.com/apache/incubator-doris/issues/5292


   在前两天线上出现一个问题,我三个FE,出现了一个FE挂掉,然后我重启启动不起来,
   
   我备份这个FE元数据后,将这个节点元数据清除掉,使用
   
   --先删除
   ALTER SYSTEM DROP FOLLOWER "FE:9010"
   --在添加
   ALTER SYSTEM ADD FOLLOWER "FE:9010"
   删除改节点,然后在使用--helper方式把该节点作为一个新的FE加入到集群中,但是这时候启动会报错,同时会导致Master FE挂掉,具体异常信息如下
   
   repImpl=com.sleepycat.je.rep.impl.RepImpl@68fa4d19 
props={REFRESH_VLSN=17921230, PORT=9010, HOSTNAME=172.22.197.238, 
P_NODETYPE1=ELECTABLE, NODE_NAME=172.22.197.238_9010_1611290318143, 
P_NODETYPE0=SECONDARY, P_NODENAME1=172.22.197.240_9010_1608972313975, 
P_PORT1=9010, P_NODENAME0=172.22.197.238_9010_1611290318143, P_PORT0=9010, 
P_HOSTNAME1=172.22.197.240, GROUP_NAME=PALO_JOURNAL_GROUP, 
P_HOSTNAME0=172.22.197.238, ENV_DIR=/hdd_data01/doris-meta/bdb, 
P_NUMPROVIDERS=2}
    at 
com.sleepycat.je.rep.InsufficientLogException.wrapSelf(InsufficientLogException.java:315)
 ~[je-7.3.7.jar:7.3.7]
    at 
com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766) 
~[je-7.3.7.jar:7.3.7]
   
   
   2021-01-22 15:09:58,105 WARN (main|1) [Catalog.getClusterIdAndRole():890] 
current node is not added to the group. please add it first. sleep 5 seconds 
and retry, current helper nodes: [172.22.224.101:9010]
   2021-01-22 15:10:03,106 WARN (main|1) 
[Catalog.getFeNodeTypeAndNameFromHelpers():1039] failed to get fe node type 
from helper node: 172.22.224.101:9010.
   java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_261]
    at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476) 
~[?:1.8.0_261]
    at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
 ~[?:1.8.0_261]
    at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200) 
~[?:1.8.0_261]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394) ~[?:1.8.0_261]
    at java.net.Socket.connect(Socket.java:606) ~[?:1.8.0_261]
    at java.net.Socket.connect(Socket.java:555) ~[?:1.8.0_261]
    at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[?:1.8.0_261]
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) 
~[?:1.8.0_261]
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) 
~[?:1.8.0_261]
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:1.8.0_261]
    at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_261]
    at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_261]
    at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
 ~[?:1.8.0_261]
    at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
 ~[?:1.8.0_261]
    at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
 ~[?:1.8.0_261]
    at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) 
~[?:1.8.0_261]
    at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
 ~[?:1.8.0_261]
    at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnectio
   
   
后来在社区,缪小姐姐及陈明雨大神的协助下,进行了各种尝试定位,认为是在启动的时候元数据同步异常,这个异常可能是因为我当时的load数据任务在同步修改元数据,造成的,后来在凌晨,生产开货完成以后,停掉所有load任务,然后执行删除问题FE节点元数据,然后在重新使用--helper启动,依然报错,最后没办法,尝试将master节点的fe元数据拷贝到问题节点FE,将问题节点FE的元数据目录删除,然后重建,将赋值过来的元数据,拷贝到元数据目录
   
   (临时解决方案)具体步骤:
   
   1.停止所有load任务
   2.删除元数据目录,并重建目录
   3.从master节点拷贝元数据到问题节点fe(将 fe.conf 中的 metadata_failure_recovery=true 
配置项删除,或者设置为 false,这个非常重要)
   4.执行 ALTER SYSTEM DROP FOLLOWER  删除改节点
   5.在问题节点使用--helper启动服务
   6.在mysql下执行 ALTER SYSTEM ADD FOLLOWER 将FE节点从新加入进去
   7.启动正常
   
   注意:
   
   1.问题节点:将 fe.conf 中的 metadata_failure_recovery=true 配置项删除,或者设置为 false
   2.Master节点启动使用 
metadata_failure_recovery=true启动,进行恢复,启动正常以后,将这个配置删除或者设置为false,停掉Master 
fe,然后在重启,启动完成以后,要确认Master查询,导入是正常的
   上述步骤执行完成以后,然后在问题节点在使用--helper方式启动fe,这个时候正常启动,问题解决
   
   重新启动所有Load任务


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to