nijel created HDFS-9046:
---------------------------
Summary: Any Error during BPOfferService run can leads to Missing
DN.
Key: HDFS-9046
URL: https://issues.apache.org/jira/browse/HDFS-9046
Project: Hadoop HDFS
Issue Type: Bug
Reporter: nijel
Assignee: nijel
The cluster is ins HA mode and each DN having only one block pool.
The issue is once after switch one DN is missing from the current active NN.
Upon analysis I found that there is one exception in BPOfferService.run()
{noformat}
2015-08-21 09:02:11,190 | WARN | DataNode:
[[[DISK]file:/srv/BigData/hadoop/data5/dn/
[DISK]file:/srv/BigData/hadoop/data4/dn/]] heartbeating to
160-149-0-114/160.149.0.114:25000 | Unexpected exception in block pool Block
pool BP-284203724-160.149.0.114-1438774011693 (Datanode Uuid
15ce1dd7-227f-4fd2-9682-091aa6bc2b89) service to
160-149-0-114/160.149.0.114:25000 | BPServiceActor.java:830
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:172)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:221)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:1887)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:669)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:616)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:856)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:822)
at java.lang.Thread.run(Thread.java:745)
{noformat}
After this particular BPOfferService is down during the run time.
And this particular NN will not have the details of this DN
Similar issues are discussed in the following JIRAs
https://issues.apache.org/jira/browse/HDFS-2882
https://issues.apache.org/jira/browse/HDFS-7714
Can we retry in this case also with a larger interval instead of shutting down
this BPOfferService ?
I think since this exceptions can occur randomly in DN it is not good to keep
the DN running where some NN does not have the info !
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)