RE: java.io.IOException on Namenode logs

2017-07-03 Thread Brahma Reddy Battula
Hi Nishant Verma

It will be great, if you mention which version of Hadoop you are using.

Apart from your findings(even I appreciate) and daemeon mentioned, you can 
check following also.


1)  Non-dfs used is more(you can check in namenodeUI/adminreport/jmx)

2)  Scheduled blocks are more(you can check jmx)

If there is any possibility enable the debug logs which can give useful info.


--Brahma Reddy Battula

From: daemeon reiydelle [mailto:daeme...@gmail.com]
Sent: 04 July 2017 01:04
To: Nishant Verma
Cc: user
Subject: Re: java.io.IOException on Namenode logs

A possibility is that the node showing errors was not able to get tcp 
connection, or heavy network conjestion, or (possibly) heavy garbage collection 
tomeouts. Would suspect network
...
There is no sin except stupidity - Oscar Wilde
...
Daemeon (Dæmœn) Reiydelle
USA 1.415.501.0198

On Jul 3, 2017 12:27 AM, "Nishant Verma" 
<nishant.verma0...@gmail.com<mailto:nishant.verma0...@gmail.com>> wrote:
Hello

I am having Kafka Connect writing records on my HDFS nodes. HDFS cluster has 3 
datanodes. Last night I observed data loss in records committed to HDFS. There 
was no issue on Kafka Connect side. However, I can see Namenode showing below 
error logs:

java.io.IOException: File 
/topics/+tmp/testTopic/year=2017/month=07/day=03/hour=03/8237cfb7-2b3d-4d5c-ab04-924c0f647cd6_tmp
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
3 datanode(s) running and no node(s) are excluded in this operation.
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed 
to place enough replicas, still in need of 3 to reach 3 
(unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, 
storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
newBlock=true) For more information, please enable DEBUG log level on 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy


Before occurence of every such line, we see below line:
2017-07-02 23:33:43,255 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
10.1.2.3:4982<http://10.1.2.3:4982> Call#274492 Retry#0

10.1.2.3 is one of the Kafka Connect nodes.


I checked below things:

- There is no disk issue on datanodes. There is 110 GB space left in each 
datanode.
- In dfsadmin report, there are 3 live datanodes showing.
- dfs.datanode.du.reserved is used as its default value i.e. 0
- dfs.replication is set as 3.
- dfs.datanode.handler.count is used as its default value i.e. 10.
- dfs.datanode.data.dir.perm is used as its default value i.e. 700. But single 
user is used everywhere. So permission issue would not be there. Also, it did 
give accurate result for 22 hours and happened after 22nd hour.
- Could not find any error occurrence for this timestamp in datanode logs.
- The path where dfs.data.dir points has 64% space available on disk.

What could be the cause of this error and how to fix this? Why is it saying the 
file could only be replicated to 0 nodes when it also says there are 3 
datanodes available?

Thanks
Nishant



Re: java.io.IOException on Namenode logs

2017-07-03 Thread daemeon reiydelle
A possibility is that the node showing errors was not able to get tcp
connection, or heavy network conjestion, or (possibly) heavy garbage
collection tomeouts. Would suspect network

...
There is no sin except stupidity - Oscar Wilde
...
Daemeon (Dæmœn) Reiydelle
USA 1.415.501.0198

On Jul 3, 2017 12:27 AM, "Nishant Verma" 
wrote:

> Hello
>
> I am having Kafka Connect writing records on my HDFS nodes. HDFS cluster
> has 3 datanodes. Last night I observed data loss in records committed to
> HDFS. There was no issue on Kafka Connect side. However, I can see Namenode
> showing below error logs:
>
> java.io.IOException: File /topics/+tmp/testTopic/year=
> 2017/month=07/day=03/hour=03/8237cfb7-2b3d-4d5c-ab04-924c0f647cd6_tmp
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 3 datanode(s) running and no node(s) are excluded in this operation.
> at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.
> chooseTarget4NewBlock(BlockManager.java:1571)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> getNewBlockTargets(FSNamesystem.java:3107)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> getAdditionalBlock(FSNamesystem.java:3031)
> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.
> addBlock(NameNodeRpcServer.java:725)
> at org.apache.hadoop.hdfs.protocolPB.
> ClientNamenodeProtocolServerSideTranslatorPB.addBlock(
> ClientNamenodeProtocolServerSideTranslatorPB.java:492)
> at org.apache.hadoop.hdfs.protocol.proto.
> ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(
> ClientNamenodeProtocolProtos.java)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
> ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1698)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy:
> Failed to place enough replicas, still in need of 3 to reach 3
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7,
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]},
> newBlock=true) For more information, please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
>
>
> Before occurence of every such line, we see below line:
> 2017-07-02 23:33:43,255 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 9000, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock
> from 10.1.2.3:4982 Call#274492 Retry#0
>
> 10.1.2.3 is one of the Kafka Connect nodes.
>
>
> I checked below things:
>
> - There is no disk issue on datanodes. There is 110 GB space left in each
> datanode.
> - In dfsadmin report, there are 3 live datanodes showing.
> - dfs.datanode.du.reserved is used as its default value i.e. 0
> - dfs.replication is set as 3.
> - dfs.datanode.handler.count is used as its default value i.e. 10.
> - dfs.datanode.data.dir.perm is used as its default value i.e. 700. But
> single user is used everywhere. So permission issue would not be there.
> Also, it did give accurate result for 22 hours and happened after 22nd hour.
> - Could not find any error occurrence for this timestamp in datanode logs.
> - The path where dfs.data.dir points has 64% space available on disk.
>
> What could be the cause of this error and how to fix this? Why is it
> saying the file could only be replicated to 0 nodes when it also says there
> are 3 datanodes available?
>
> Thanks
> Nishant
>
>