Hi,
I have a cluster of 5 nodes with one large table that currently has around
12000 regions. Everything was working fine for relatively long time, until
now.
Yesterday I significantly reduced the TTL on the table and initiated major
compaction. This should have reduced the table size to about 20% of its
original size.
Today, I'm getting errors of inaccessible files on HDFS, for example:
java.io.IOException: Got error in response to OP_READ_BLOCK self=/
10.1.104.2:58047, remote=/10.1.104.2:50010 for file
/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360
for block 3674866614142268536_674205
        at
org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at
org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at
org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:128)
        at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:68)
        at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1433)
        at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:139)
        at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77)
        at
org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1341)
        at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2269)
        at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1126)
        at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1118)
        at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1102)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1781)
        at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

I checked and the file, indeed doesn't exist on HDFS, here is the name node
logs for this block, apparently because it was deleted:
2011-06-19 21:39:36,651 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp/2096863423111131624.
blk_3674866614142268536_674205
2011-06-19 21:40:11,954 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.1.104.2:50010 is added to
blk_3674866614142268536_674205 size 67108864
2011-06-19 21:40:11,954 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.1.104.3:50010 is added to
blk_3674866614142268536_674205 size 67108864
2011-06-19 21:40:11,955 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.1.104.5:50010 is added to
blk_3674866614142268536_674205 size 67108864
2011-06-29 20:20:01,662 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1.104.2:50010 to delete  blk_3674866614142268536_674205
2011-06-29 20:20:13,671 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1.104.5:50010 to delete  blk_-4056387895369608597_675174
blk_-5017882805850873821_672281 blk_702373987100607684_672288
blk_-5357157478043290010_668506 blk_7118175133735412789_674903
blk_-3569812563715986384_675231 blk_8296855057240604851_669285
blk_-6483679172530609101_674268 blk_8738539715363739108_673682
blk_1744841904626813502_675238 blk_-6035315106100051103_674266
blk_-1789501623010070237_674908 blk_1944054629336265129_673689
blk_3674866614142268536_674205 blk_7930425446738143892_647410
blk_-3007186753042268449_669296 blk_-5482302621772778061_647416
blk_-3765735404924932181_672004 blk_7476090998956811081_675169
blk_7862291659285127712_646890 blk_-2666244746343584727_672013
blk_6039172613960915602_674206 blk_-8470884397893086564_646899
blk_4558230221166712802_668510
2011-06-29 20:20:46,698 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1.104.3:50010 to delete  blk_-7851606440036350812_671552
blk_9214649160203453845_647566 blk_702373987100607684_672288
blk_5958099369749234073_668143 blk_-5172218034084903173_673109
blk_-2934555181472719276_646476 blk_-1409986679370073931_672552
blk_-2786034325506235869_669086 blk_3674866614142268536_674205
blk_510158930393283118_673225 blk_916244738216205237_677068
blk_-4317027806407316617_670379 blk_8555705688850972639_673485
blk_-3765735404924932181_672004 blk_-5482302621772778061_647416
blk_-2461801145731752623_674605 blk_-8737702908048998927_672549
blk_-8470884397893086564_646899 blk_4558230221166712802_668510
blk_-4056387895369608597_675174 blk_-8675430610673886073_647695
blk_-6642870230256028318_668211 blk_-3890408516362176771_677483
blk_-3569812563715986384_675231 blk_-5007142629771321873_674548
blk_-3345355191863431669_667066 blk_8296855057240604851_669285
blk_-6595462308187757470_672420 blk_-2583945228783203947_674607
blk_-346988625120916345_677063 blk_4449525876338684218_674496
blk_2617172363857549730_668201 blk_8738539715363739108_673682
blk_-208904675456598428_679286 blk_-497549341281882641_646477
blk_-6035315106100051103_674266 blk_-2356539038067297411_672388
blk_-3881703084497103249_668137 blk_2214397881104950315_646643
blk_-5907671443455357710_673223 blk_-2431880309956605679_669204
blk_6039172613960915602_674206 blk_5053643911633142711_669194
blk_-2636977729205236686_674664

I assume the file loss is somehow related to this change and the major
compaction that followed because the same scan that is failing now was
working fine yesterday and that is the only changed that happened on the
cluster. Any suggestions what to do now?

Thanks.

-eran

Reply via email to