[jira] [Assigned] (HBASE-24454) BucketCache disabled instantly before error duration toleration is reached due to timing issue
[ https://issues.apache.org/jira/browse/HBASE-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack reassigned HBASE-24454: - Assignee: Jacob LeBlanc (was: Michael Stack) > BucketCache disabled instantly before error duration toleration is reached > due to timing issue > -- > > Key: HBASE-24454 > URL: https://issues.apache.org/jira/browse/HBASE-24454 > Project: HBase > Issue Type: Bug > Components: BucketCache >Affects Versions: 1.4.10 > Environment: We saw this in HBase 1.4.10 (EMR 5.28.1), but I believe > the newest code still has this problem. >Reporter: Jacob LeBlanc >Assignee: Jacob LeBlanc >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.0, 1.7.0, 2.2.6 > > > We saw an instance where BucketCache was disabled after only two IO error > were encountered at nearly the same time. The following shows all errors from > a region server log for the 2020-05-26 17:00 hour. Notice that there are no > other errors until the 17:14:50 at which time the BucketCache is disabled > because it claims duration time has exceeded 6 ms: > $ grep ERROR hbase-hbase-regionserver-ip-172-20-113-147.log.2020-05-26-17 > 2020-05-26 17:14:50,396 ERROR [hfile-prefetch-1589117924884] > bucket.BucketCache: Failed reading block > 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache > 2020-05-26 17:14:50,397 ERROR > [regionserver/ip-172-20-113-147.us-west-2.compute.internal/172.20.113.147:16020-BucketCacheWriter-0] > bucket.BucketCache: Failed syncing IO engine > 2020-05-26 17:14:50,400 ERROR [hfile-prefetch-1589117924884] > bucket.BucketCache: Failed reading block > 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache > 2020-05-26 17:14:50,401 ERROR [hfile-prefetch-1589117924884] > bucket.BucketCache: IO errors duration time has exceeded 6ms, disabling > cache, please check your IOEngine > The region server is very busy so it should be constantly getting successful > reads and writes in the bucket cache (it is not as though there was some long > ago error and then no successful IO to clear the error flag). > We are running a busy EMR cluster backed by S3. A bucketcache getting > disabled is a huge performance issue because all reads must go to S3. > Looking at the code, I believe I've found a timing issue. Here is the code > for checking and setting the ioErrorStartTime (taken from BucketCache.java): > > {code:java} > /** >* Check whether we tolerate IO error this time. If the duration of IOEngine >* throwing errors exceeds ioErrorsDurationTimeTolerated, we will disable > the >* cache >*/ > private void checkIOErrorIsTolerated() { > long now = EnvironmentEdgeManager.currentTime(); > if (this.ioErrorStartTime > 0) { > if (cacheEnabled && (now - ioErrorStartTime) > > this.ioErrorsTolerationDuration) { > LOG.error("IO errors duration time has exceeded " + > ioErrorsTolerationDuration + > "ms, disabling cache, please check your IOEngine"); > disableCache(); > } > } else { > this.ioErrorStartTime = now; > } > } > {code} > > And here is the code for clearing the ioErrorStartTime when a successful read > or write is done: > {code:java} > if (this.ioErrorStartTime > 0) { > ioErrorStartTime = -1; > } > {code} > Notice that that if ioErrorStartTime is set to -1 after the first if > statement in checkIOErrorIsTolerated but before (now - ioErrorStartTime), > then (now - ioErrorStartTime) will evaluate to (now + 1) resulting in the > code thinking it has exceeded ioErrorsTolerationDuration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HBASE-24454) BucketCache disabled instantly before error duration toleration is reached due to timing issue
[ https://issues.apache.org/jira/browse/HBASE-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob LeBlanc reassigned HBASE-24454: - Assignee: Jacob LeBlanc > BucketCache disabled instantly before error duration toleration is reached > due to timing issue > -- > > Key: HBASE-24454 > URL: https://issues.apache.org/jira/browse/HBASE-24454 > Project: HBase > Issue Type: Bug > Components: BucketCache >Affects Versions: 1.4.10 > Environment: We saw this in HBase 1.4.10 (EMR 5.28.1), but I believe > the newest code still has this problem. >Reporter: Jacob LeBlanc >Assignee: Jacob LeBlanc >Priority: Major > > We saw an instance where BucketCache was disabled after only two IO error > were encountered at nearly the same time. The following shows all errors from > a region server log for the 2020-05-26 17:00 hour. Notice that there are no > other errors until the 17:14:50 at which time the BucketCache is disabled > because it claims duration time has exceeded 6 ms: > $ grep ERROR hbase-hbase-regionserver-ip-172-20-113-147.log.2020-05-26-17 > 2020-05-26 17:14:50,396 ERROR [hfile-prefetch-1589117924884] > bucket.BucketCache: Failed reading block > 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache > 2020-05-26 17:14:50,397 ERROR > [regionserver/ip-172-20-113-147.us-west-2.compute.internal/172.20.113.147:16020-BucketCacheWriter-0] > bucket.BucketCache: Failed syncing IO engine > 2020-05-26 17:14:50,400 ERROR [hfile-prefetch-1589117924884] > bucket.BucketCache: Failed reading block > 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache > 2020-05-26 17:14:50,401 ERROR [hfile-prefetch-1589117924884] > bucket.BucketCache: IO errors duration time has exceeded 6ms, disabling > cache, please check your IOEngine > The region server is very busy so it should be constantly getting successful > reads and writes in the bucket cache (it is not as though there was some long > ago error and then no successful IO to clear the error flag). > We are running a busy EMR cluster backed by S3. A bucketcache getting > disabled is a huge performance issue because all reads must go to S3. > Looking at the code, I believe I've found a timing issue. Here is the code > for checking and setting the ioErrorStartTime (taken from BucketCache.java): > > {code:java} > /** >* Check whether we tolerate IO error this time. If the duration of IOEngine >* throwing errors exceeds ioErrorsDurationTimeTolerated, we will disable > the >* cache >*/ > private void checkIOErrorIsTolerated() { > long now = EnvironmentEdgeManager.currentTime(); > if (this.ioErrorStartTime > 0) { > if (cacheEnabled && (now - ioErrorStartTime) > > this.ioErrorsTolerationDuration) { > LOG.error("IO errors duration time has exceeded " + > ioErrorsTolerationDuration + > "ms, disabling cache, please check your IOEngine"); > disableCache(); > } > } else { > this.ioErrorStartTime = now; > } > } > {code} > > And here is the code for clearing the ioErrorStartTime when a successful read > or write is done: > {code:java} > if (this.ioErrorStartTime > 0) { > ioErrorStartTime = -1; > } > {code} > Notice that that if ioErrorStartTime is set to -1 after the first if > statement in checkIOErrorIsTolerated but before (now - ioErrorStartTime), > then (now - ioErrorStartTime) will evaluate to (now + 1) resulting in the > code thinking it has exceeded ioErrorsTolerationDuration. -- This message was sent by Atlassian Jira (v8.3.4#803005)