[ https://issues.apache.org/jira/browse/TRAFODION-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gonzalo E Correa reassigned TRAFODION-2664: ------------------------------------------- Assignee: Gonzalo E Correa > Instance will be down when the zookeeper on name node has been down > ------------------------------------------------------------------- > > Key: TRAFODION-2664 > URL: https://issues.apache.org/jira/browse/TRAFODION-2664 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation > Affects Versions: 2.2-incubating > Environment: Test Environment: > CDH5.4.8: 10.10.23.19:7180, total 6 nodes. > HDFS-HA and DCS-HA: enabled > OS: Centos6.8, physic machine. > SW Build: R2.2.3 (EsgynDB_Enterprise Release 2.2.3 (Build release [sbroeder], > branch 1ce8d39-xdc_nari, date 11Jun17) > Reporter: Jarek > Assignee: Gonzalo E Correa > Priority: Critical > Labels: build > Fix For: 2.2-incubating > > > Description: Instance will be down when the zookeeper on name node has been > down > Test Steps: > Step 1. Start OE and 4 long queries with trafci on the first node > esggy-clu-n010 > Step 2. Wait several minutes and stop zookeeper on name node of node > esggy-clu-n010 in Cloudera Manager page. > Step 3. With trafci, run a basic query and 4 long queries again. > In the above Step 3, we will see the whole instance as down after a while. > For this test scenario, I tried it several times, always found instance as > down. > Timestamp: > Test Start Time: 20170616132939 > Test End Time: 20170616134350 > Stop zookeeper on name node of node esggy-clu-n010: 20170616133344 > Check logs: > 1) Each node displays the following error: > 2017-06-16 13:33:46,276, ERROR, MON, Node Number: 0,, PIN: 5017 , Process > Name: $MONITOR,,, TID: 5429, Message ID: 101371801, > [CZClient::IsZNodeExpired], zoo_exists() for > /trafodion/instance/cluster/esggy-clu-n010.esgyn.cn failed with error > ZCONNECTIONLOSS > 2) Zookeeper displays: > ls /trafodion/instance/cluster > [] > So, It seems zclient has been lost on each node. > Location of logs: > esggy-clu-n010: > /data4/jarek/ha.interactive/trafodion_and_cluster_logs/cluster_logs.20170616134816.tar.gz > and trafodion_logs.20170616134816.tar.gz > By the way, because the size of the logs is out of the limited value, so i > cannot upload it as the attachment in this JIRA ID. > How many zookeeper quorum servers in the cluster? total 3. > <property> > <name>dcs.zookeeper.quorum</name> > > <value>esggy-clu-n010.esgyn.cn,esggy-clu-n011.esgyn.cn,esggy-clu-n012.esgyn.cn</value> > </property> > How to access the cluster? > 1. Login 10.10.10.8 from US machine: trafodion/traf123 > 2. Login 10.10.23.19 from 10.10.10.8: trafodion/traf123 -- This message was sent by Atlassian JIRA (v6.4.14#64029)