[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534592#comment-15534592 ]
Hudson commented on HBASE-16721: -------------------------------- FAILURE: Integrated in Jenkins build HBase-1.3-JDK7 #25 (See [https://builds.apache.org/job/HBase-1.3-JDK7/25/]) HBASE-16721 Concurrency issue in WAL unflushed seqId tracking (enis: rev f77f1530d4cebd1679bc1c27782bc283638dbd5f) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WAL.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestFSHLog.java > Concurrency issue in WAL unflushed seqId tracking > ------------------------------------------------- > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 1.0.0, 1.1.0, 1.2.0 > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_v1.branch-1.patch, > hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)