[ https://issues.apache.org/jira/browse/HBASE-23601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013288#comment-17013288 ]
Hudson commented on HBASE-23601: -------------------------------- Results for branch branch-2 [build #2412 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2412/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2412//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2412//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2412//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > OutputSink.WriterThread exception gets stuck and repeated indefinietly > ---------------------------------------------------------------------- > > Key: HBASE-23601 > URL: https://issues.apache.org/jira/browse/HBASE-23601 > Project: HBase > Issue Type: Bug > Components: read replicas > Affects Versions: 2.2.2 > Reporter: Szabolcs Bukros > Assignee: Szabolcs Bukros > Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.9, 2.2.4 > > > When a WriterThread runs into an exception (ie: NotServingRegionException), > the exception is stored in the controller. It is never removed and can not be > overwritten either. > > {code:java} > public void run() { > try { > doRun(); > } catch (Throwable t) { > LOG.error("Exiting thread", t); > controller.writerThreadError(t); > } > }{code} > Thanks to this every time PipelineController.checkForErrors() is called the > same old exception is rethrown. > > For example in RegionReplicaReplicationEndpoint.replicate there is a while > loop that does the actual replicating. Every time it loops, it calls > checkForErrors(), catches the rethrown exception, logs it but does nothing > about it. This results in ~2GB log files in ~5min in my experience. > > My proposal would be to clean up the stored exception when it reaches > RegionReplicaReplicationEndpoint.replicate and make sure we restart the > WriterThread that died throwing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)