[ 
https://issues.apache.org/jira/browse/HBASE-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-17501:
-----------------------------------
    Fix Version/s:     (was: 1.4.0)

> NullPointerException after Datanodes Decommissioned and Terminated
> ------------------------------------------------------------------
>
>                 Key: HBASE-17501
>                 URL: https://issues.apache.org/jira/browse/HBASE-17501
>             Project: HBase
>          Issue Type: Bug
>          Components: Filesystem Integration, Operability
>    Affects Versions: 1.2.0
>         Environment: CentOS Derivative with a derivative of the 3.18.43 
> kernel.  HBase on CDH5.9.0 with some patches.  HDFS CDH 5.9.0 with no patches.
>            Reporter: Patrick Dignan
>            Assignee: James Moore
>            Priority: Minor
>             Fix For: 2.0.0, 1.3.1, 1.2.6
>
>         Attachments: HBASE_17501.patch, HBASE_17501.patch, 
> HBASE_17501.patch.v2, HBASE_17501.patch.v3, HBASE_17501.patch.v4, 
> HBASE_17501.v3
>
>
> We recently encountered an interesting NullPointerException in HDFS that 
> bubbles up to HBase, and is resolved be restarting the regionserver.  The 
> issue was exhibited while we were replacing a set of nodes in one of our 
> clusters with a new set.  We did the following:
> 1. Turn off the HBase balancer
> 2. Gracefully move the regions off the nodes we’re shutting off using a tool 
> we wrote to do so
> 3. Decommission the datanodes using the HDFS exclude hosts file and hdfs 
> dfsadmin -refreshNodes
> 4. Wait for the datanodes to decommission fully
> 5. Terminate the VMs the instances are running inside.
> A few notes.  We did not shutdown the datanode processes, and the nodes were 
> therefore not marked as dead by the namenode.  We simply terminated the 
> datanode VM (in this case an AWS instance).  The nodes were marked as 
> decommissioned.  We are running our clusters with DNS, and when we terminate 
> VMs, the associated CName is removed and no longer resolves.  The errors do 
> not seem to resolve without a restart.
> After we did this, the remaining regionservers started throwing 
> NullPointerExceptions with the following stack trace:
> 2017-01-19 23:09:05,638 DEBUG org.apache.hadoop.hbase.ipc.RpcServer: 
> RpcServer.RW.fifo.Q.read.handler=80,queue=14,port=60020: callId: 1727723891 
> service: ClientService methodName: Scan size: 216 connection: 
> 172.16.36.128:31538
> java.io.IOException
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2214)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:204)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
> Caused by: java.lang.NullPointerException
>     at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1564)
>     at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1434)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1682)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:445)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:199)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:343)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:198)
>     at 
> org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2106)
>     at 
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2096)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5544)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2569)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2555)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2536)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2405)
>     at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33738)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
>     ... 3 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to