[ https://issues.apache.org/jira/browse/HBASE-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265148#comment-17265148 ]
Prathyusha commented on HBASE-24972: ------------------------------------ [~stack] Below is the stack trace of a failure incident we have seen - Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/table/SYSTEM.CATALOG StackTrace: org.apache.zookeeper.KeeperException.create(KeeperException.java:99) org.apache.zookeeper.KeeperException.create(KeeperException.java:51) org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1337) org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:354) org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:625) ... StackTraceId: 429763122 But yes, I see the retries in place where ever we are doing write operations. [~sandeep.guggilam] These retries should suffice I guess. Any thoughts? > Wait for connection attempt to succeed before performing operations on ZK > ------------------------------------------------------------------------- > > Key: HBASE-24972 > URL: https://issues.apache.org/jira/browse/HBASE-24972 > Project: HBase > Issue Type: Bug > Reporter: Sandeep Guggilam > Assignee: Prathyusha > Priority: Minor > > {color:#1d1c1d}Creating the connection with ZK is asynchronous and notified > via the passed in watcher about the successful connection event. When we > attempt any operations, we try to create a connection and then perform a > read/write > ({color}{color:#1d1c1d}[https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java#L323]{color}{color:#1d1c1d}) > without really waiting for the notification event > ([https://github.com/apache/hbase/blob/979edfe72046b2075adcc869c65ae820e6f3ec2d/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKWatcher.java#L582)]{color} > > {color:#1d1c1d}It is possible we get ConnectionLoss errors when we perform > operations on ZK without waiting for the connection attempt to succeed{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)