[ https://issues.apache.org/jira/browse/HBASE-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Purtell updated HBASE-17501: ----------------------------------- Fix Version/s: (was: 1.4.0) > NullPointerException after Datanodes Decommissioned and Terminated > ------------------------------------------------------------------ > > Key: HBASE-17501 > URL: https://issues.apache.org/jira/browse/HBASE-17501 > Project: HBase > Issue Type: Bug > Components: Filesystem Integration, Operability > Affects Versions: 1.2.0 > Environment: CentOS Derivative with a derivative of the 3.18.43 > kernel. HBase on CDH5.9.0 with some patches. HDFS CDH 5.9.0 with no patches. > Reporter: Patrick Dignan > Assignee: James Moore > Priority: Minor > Fix For: 2.0.0, 1.3.1, 1.2.6 > > Attachments: HBASE_17501.patch, HBASE_17501.patch, > HBASE_17501.patch.v2, HBASE_17501.patch.v3, HBASE_17501.patch.v4, > HBASE_17501.v3 > > > We recently encountered an interesting NullPointerException in HDFS that > bubbles up to HBase, and is resolved be restarting the regionserver. The > issue was exhibited while we were replacing a set of nodes in one of our > clusters with a new set. We did the following: > 1. Turn off the HBase balancer > 2. Gracefully move the regions off the nodes we’re shutting off using a tool > we wrote to do so > 3. Decommission the datanodes using the HDFS exclude hosts file and hdfs > dfsadmin -refreshNodes > 4. Wait for the datanodes to decommission fully > 5. Terminate the VMs the instances are running inside. > A few notes. We did not shutdown the datanode processes, and the nodes were > therefore not marked as dead by the namenode. We simply terminated the > datanode VM (in this case an AWS instance). The nodes were marked as > decommissioned. We are running our clusters with DNS, and when we terminate > VMs, the associated CName is removed and no longer resolves. The errors do > not seem to resolve without a restart. > After we did this, the remaining regionservers started throwing > NullPointerExceptions with the following stack trace: > 2017-01-19 23:09:05,638 DEBUG org.apache.hadoop.hbase.ipc.RpcServer: > RpcServer.RW.fifo.Q.read.handler=80,queue=14,port=60020: callId: 1727723891 > service: ClientService methodName: Scan size: 216 connection: > 172.16.36.128:31538 > java.io.IOException > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2214) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:204) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183) > Caused by: java.lang.NullPointerException > at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1564) > at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1434) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1682) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:445) > at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:199) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:343) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:198) > at > org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2106) > at > org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2096) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5544) > at > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2569) > at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2555) > at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2536) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2405) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33738) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170) > ... 3 more -- This message was sent by Atlassian JIRA (v6.4.14#64029)