Did the vendor say whether the patch is for hbase or some other component ?
Thanks On Wed, Feb 28, 2018 at 6:33 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > Thanks for the feedback, so you guys are right the bucket cache is getting > disabled due to too many I/O errors from the underlying files making up the > bucket cache. Still do not know the exact underlying cause, but we are > working with our vendor to test a patch they provided that seems to have > resolved the issue for now. They say if it works out well they will > eventually try to promote the patch to the open source versions. > > Cheers. > > ---- > Saad > > > On Sun, Feb 25, 2018 at 11:10 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Here is related code for disabling bucket cache: > > > > if (this.ioErrorStartTime > 0) { > > > > if (cacheEnabled && (now - ioErrorStartTime) > this. > > ioErrorsTolerationDuration) { > > > > LOG.error("IO errors duration time has exceeded " + > > ioErrorsTolerationDuration + > > > > "ms, disabling cache, please check your IOEngine"); > > > > disableCache(); > > > > Can you search in the region server log to see if the above occurred ? > > > > Was this server the only one with disabled cache ? > > > > Cheers > > > > On Sun, Feb 25, 2018 at 6:20 AM, Saad Mufti <saad.mu...@oath.com.invalid > > > > wrote: > > > > > HI, > > > > > > I am running an HBase 1.3.1 cluster on AWS EMR. The bucket cache is > > > configured to use two attached EBS disks of 50 GB each and I > provisioned > > > the bucket cache to be a bit less than the total, at a total of 98 GB > per > > > instance to be on the safe side. My tables have column families set to > > > prefetch on open. > > > > > > On some instances during cluster startup, the bucket cache starts > > throwing > > > errors, and eventually the bucket cache gets completely disabled on > this > > > instance. The instance still stays up as a valid region server and the > > only > > > clue in the region server UI is that the bucket cache tab reports a > count > > > of 0, and size of 0 bytes. > > > > > > I have already opened a ticket with AWS to see if there are problems > with > > > the EBS volumes, but wanted to tap the open source community's > hive-mind > > to > > > see what kind of problem would cause the bucket cache to get disabled. > If > > > the application depends on the bucket cache for performance, wouldn't > it > > be > > > better to just remove that region server from the pool if its bucket > > cache > > > cannot be recovered/enabled? > > > > > > The error look like the following. Would appreciate any insight, thank: > > > > > > 2018-02-25 01:12:47,780 ERROR [hfile-prefetch-1519513834057] > > > bucket.BucketCache: Failed reading block > > > 332b0634287f4c42851bc1a55ffe4042_1348128 from bucket cache > > > java.nio.channels.ClosedByInterruptException > > > at > > > java.nio.channels.spi.AbstractInterruptibleChannel.end( > > > AbstractInterruptibleChannel.java:202) > > > at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl. > > > java:746) > > > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine$ > > > FileReadAccessor.access(FileIOEngine.java:219) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine. > > > accessFile(FileIOEngine.java:170) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine. > > > read(FileIOEngine.java:105) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache. > > > getBlock(BucketCache.java:492) > > > at > > > org.apache.hadoop.hbase.io.hfile.CombinedBlockCache. > > > getBlock(CombinedBlockCache.java:84) > > > at > > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2. > > > getCachedBlock(HFileReaderV2.java:279) > > > at > > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock( > > > HFileReaderV2.java:420) > > > at > > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$1.run( > > > HFileReaderV2.java:209) > > > at > > > java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:511) > > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ > > > ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ > > ScheduledFutureTask.run( > > > ScheduledThreadPoolExecutor.java:293) > > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker( > > > ThreadPoolExecutor.java:1149) > > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run( > > > ThreadPoolExecutor.java:624) > > > at java.lang.Thread.run(Thread.java:748) > > > > > > and > > > > > > 2018-02-25 01:12:52,432 ERROR [regionserver/ > > > ip-xx-xx-xx-xx.xx-xx-xx.us-east-1.ec2.xx.net/xx.xx.xx.xx: > > > 16020-BucketCacheWriter-7] > > > bucket.BucketCache: Failed writing to bucket cache > > > java.nio.channels.ClosedChannelException > > > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl. > > java:110) > > > at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:758) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine$ > > > FileWriteAccessor.access(FileIOEngine.java:227) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine. > > > accessFile(FileIOEngine.java:170) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine. > > > write(FileIOEngine.java:116) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$ > > > RAMQueueEntry.writeToCache(BucketCache.java:1357) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$ > > WriterThread.doDrain( > > > BucketCache.java:883) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$ > > > WriterThread.run(BucketCache.java:838) > > > at java.lang.Thread.run(Thread.java:748) > > > > > > and later > > > 2018-02-25 01:13:47,783 INFO [regionserver/ > > > ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10. > > > 194.246.70:16020-BucketCacheWriter-4] > > > bucket.BucketCach > > > e: regionserver/ > > > ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10. > > > 194.246.70:16020-BucketCacheWriter-4 > > > exiting, cacheEnabled=false > > > 2018-02-25 01:13:47,864 WARN [regionserver/ > > > ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10. > > > 194.246.70:16020-BucketCacheWriter-6] > > > bucket.FileIOEngi > > > ne: Failed syncing data to /mnt1/hbase/bucketcache > > > 2018-02-25 01:13:47,864 ERROR [regionserver/ > > > ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10. > > > 194.246.70:16020-BucketCacheWriter-6] > > > bucket.BucketCach > > > e: Failed syncing IO engine > > > java.nio.channels.ClosedChannelException > > > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl. > > java:110) > > > at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:379) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine. > > > sync(FileIOEngine.java:128) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$ > > WriterThread.doDrain( > > > BucketCache.java:911) > > > at > > > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$ > > > WriterThread.run(BucketCache.java:838) > > > at java.lang.Thread.run(Thread.java:748) > > > > > > ---- > > > Saad > > > > > >