[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck due to GC

2014-10-30 Thread Eric Zhiqiang Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190032#comment-14190032
 ] 

Eric Zhiqiang Ma commented on HDFS-7180:


NFSv3 gateway works well for us with the patch. No stucks found during 
uploading ~400GB data. Thanks, Brandon and all!

 NFSv3 gateway frequently gets stuck due to GC
 -

 Key: HDFS-7180
 URL: https://issues.apache.org/jira/browse/HDFS-7180
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.5.0
 Environment: Linux, Fedora 19 x86-64
Reporter: Eric Zhiqiang Ma
Assignee: Brandon Li
Priority: Critical
 Fix For: 2.6.0

 Attachments: HDFS-7180.001.patch, HDFS-7180.002.patch, 
 HDFS-7180.003.patch


 We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway 
 on one node in the cluster to let users upload data with rsync.
 However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
 seems working well. (hdfds dfs -ls and etc. works just well). The last stuck 
 we found is after around 1 day running and several hundreds GBs of data 
 uploaded.
 The NFSv3 daemon is started on one node and on the same node the NFS is 
 mounted.
 From the node where the NFS is mounted:
 dmsg shows like this:
 [1859245.368108] nfs: server localhost not responding, still trying
 [1859245.368111] nfs: server localhost not responding, still trying
 [1859245.368115] nfs: server localhost not responding, still trying
 [1859245.368119] nfs: server localhost not responding, still trying
 [1859245.368123] nfs: server localhost not responding, still trying
 [1859245.368127] nfs: server localhost not responding, still trying
 [1859245.368131] nfs: server localhost not responding, still trying
 [1859245.368135] nfs: server localhost not responding, still trying
 [1859245.368138] nfs: server localhost not responding, still trying
 [1859245.368142] nfs: server localhost not responding, still trying
 [1859245.368146] nfs: server localhost not responding, still trying
 [1859245.368150] nfs: server localhost not responding, still trying
 [1859245.368153] nfs: server localhost not responding, still trying
 The mounted directory can not be `ls` and `df -hT` gets stuck too.
 The latest lines from the nfs3 log in the hadoop logs directory:
 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 user map size: 35
 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 group map size: 54
 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update 
 cache now
 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not 
 doing static UID/GID mapping because '/etc/nfs.map' does not exist.
 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 user map size: 35
 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 group map size: 54
 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow 
 ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 
 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: 
 [10.0.3.172:50010, 10.0.3.176:50010]
 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: 
 DFSOutputStream ResponseProcessor exception  for block 
 

[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck

2014-10-19 Thread Eric Zhiqiang Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176314#comment-14176314
 ] 

Eric Zhiqiang Ma commented on HDFS-7180:


Brandon Li: Thanks. It sounds great to me. I will be waiting for the patch and 
try it.

 NFSv3 gateway frequently gets stuck
 ---

 Key: HDFS-7180
 URL: https://issues.apache.org/jira/browse/HDFS-7180
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.5.0
 Environment: Linux, Fedora 19 x86-64
Reporter: Eric Zhiqiang Ma
Assignee: Brandon Li
Priority: Critical

 We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway 
 on one node in the cluster to let users upload data with rsync.
 However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
 seems working well. (hdfds dfs -ls and etc. works just well). The last stuck 
 we found is after around 1 day running and several hundreds GBs of data 
 uploaded.
 The NFSv3 daemon is started on one node and on the same node the NFS is 
 mounted.
 From the node where the NFS is mounted:
 dmsg shows like this:
 [1859245.368108] nfs: server localhost not responding, still trying
 [1859245.368111] nfs: server localhost not responding, still trying
 [1859245.368115] nfs: server localhost not responding, still trying
 [1859245.368119] nfs: server localhost not responding, still trying
 [1859245.368123] nfs: server localhost not responding, still trying
 [1859245.368127] nfs: server localhost not responding, still trying
 [1859245.368131] nfs: server localhost not responding, still trying
 [1859245.368135] nfs: server localhost not responding, still trying
 [1859245.368138] nfs: server localhost not responding, still trying
 [1859245.368142] nfs: server localhost not responding, still trying
 [1859245.368146] nfs: server localhost not responding, still trying
 [1859245.368150] nfs: server localhost not responding, still trying
 [1859245.368153] nfs: server localhost not responding, still trying
 The mounted directory can not be `ls` and `df -hT` gets stuck too.
 The latest lines from the nfs3 log in the hadoop logs directory:
 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 user map size: 35
 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 group map size: 54
 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
 Have to change stable write to unstable write:FILE_SYNC
 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update 
 cache now
 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not 
 doing static UID/GID mapping because '/etc/nfs.map' does not exist.
 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 user map size: 35
 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
 group map size: 54
 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow 
 ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 
 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: 
 [10.0.3.172:50010, 10.0.3.176:50010]
 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: 
 DFSOutputStream ResponseProcessor exception  for block 
 BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
 java.io.IOException: Bad response ERROR for block 
 BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 from 

[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck

2014-10-16 Thread Eric Zhiqiang Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174706#comment-14174706
 ] 

Eric Zhiqiang Ma commented on HDFS-7180:


~brandonli: Not at all and many thanks a lot for the analysis and confirmation!

I checked the log on 10.0.3.176 and found the exception of socket timeout 
between 10.0.3.172 and 10.0.3.176 as follows.

--
2014-10-02 06:00:07,326 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 src: 
/10.0.3.172:37334 dest: /10.0.3.176:
50010
2014-10-02 06:00:31,970 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Slow flushOrSync took 24097ms (threshold=300ms), isSync:true, 
flushTotalNanos=9424ns
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.0.3.176:50010 remote=/10.0.3.17
2:37334]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:745)
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643, 
type=LAST_IN_PIPELINE, downstream
s=0:[]: Thread is interrupted.
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643, 
type=LAST_IN_PIPELINE, downstream
s=0:[] terminating
2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 
received exception java.net.SocketTime
outException: 6 millis timeout while waiting for channel to be ready for 
read. ch : java.nio.channels.SocketChannel[connected local=/10.0.3.176:50010 
remote=/10.0.3.172:37334]
2014-10-02 06:01:32,093 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
dstore-176:50010:DataXceiver error processing WRITE_BLOCK operation  src: 
/10.0.3.172:37334 dst: /10.0.3.176:50
010
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.0.3.176:50010 remote=/10.0.3.17
2:37334]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at 

[jira] [Created] (HDFS-7180) NFSv3 gateway frequently gets stuck

2014-10-01 Thread Eric Zhiqiang Ma (JIRA)
Eric Zhiqiang Ma created HDFS-7180:
--

 Summary: NFSv3 gateway frequently gets stuck
 Key: HDFS-7180
 URL: https://issues.apache.org/jira/browse/HDFS-7180
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.5.0
 Environment: Linux, Fedora 19 x86-64
Reporter: Eric Zhiqiang Ma
Priority: Critical


We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on 
one node in the cluster to let users upload data with rsync.

However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
seems working well. (hdfds dfs -ls and etc. works just well). The last stuck we 
found is after around 1 day running and several hundreds GBs of data uploaded.

The NFSv3 daemon is started on one node and on the same node the NFS is mounted.

From the node where the NFS is mounted:

dmsg shows like this:

[1859245.368108] nfs: server localhost not responding, still trying
[1859245.368111] nfs: server localhost not responding, still trying
[1859245.368115] nfs: server localhost not responding, still trying
[1859245.368119] nfs: server localhost not responding, still trying
[1859245.368123] nfs: server localhost not responding, still trying
[1859245.368127] nfs: server localhost not responding, still trying
[1859245.368131] nfs: server localhost not responding, still trying
[1859245.368135] nfs: server localhost not responding, still trying
[1859245.368138] nfs: server localhost not responding, still trying
[1859245.368142] nfs: server localhost not responding, still trying
[1859245.368146] nfs: server localhost not responding, still trying
[1859245.368150] nfs: server localhost not responding, still trying
[1859245.368153] nfs: server localhost not responding, still trying

The mounted directory can not be `ls` and `df -hT` gets stuck too.

The latest lines from the nfs3 log in the hadoop logs directory:

2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
user map size: 35
2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
group map size: 54
2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update 
cache now
2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing 
static UID/GID mapping because '/etc/nfs.map' does not exist.
2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
user map size: 35
2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
group map size: 54
2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow 
ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 
status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: 
[10.0.3.172:50010, 10.0.3.176:50010]
2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block 
BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
java.io.IOException: Bad response ERROR for block 
BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 from datanode 
10.0.3.176:50010
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828)
2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 in 
pipeline 10.0.3.172:50010, 10.0.3.176:50010: bad datanode 10.0.3.176:50010

Any help will 

[jira] [Updated] (HDFS-7180) NFSv3 gateway frequently gets stuck

2014-10-01 Thread Eric Zhiqiang Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Zhiqiang Ma updated HDFS-7180:
---
Description: 
We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on 
one node in the cluster to let users upload data with rsync.

However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
seems working well. (hdfds dfs -ls and etc. works just well). The last stuck we 
found is after around 1 day running and several hundreds GBs of data uploaded.

The NFSv3 daemon is started on one node and on the same node the NFS is mounted.

From the node where the NFS is mounted:

dmsg shows like this:

[1859245.368108] nfs: server localhost not responding, still trying
[1859245.368111] nfs: server localhost not responding, still trying
[1859245.368115] nfs: server localhost not responding, still trying
[1859245.368119] nfs: server localhost not responding, still trying
[1859245.368123] nfs: server localhost not responding, still trying
[1859245.368127] nfs: server localhost not responding, still trying
[1859245.368131] nfs: server localhost not responding, still trying
[1859245.368135] nfs: server localhost not responding, still trying
[1859245.368138] nfs: server localhost not responding, still trying
[1859245.368142] nfs: server localhost not responding, still trying
[1859245.368146] nfs: server localhost not responding, still trying
[1859245.368150] nfs: server localhost not responding, still trying
[1859245.368153] nfs: server localhost not responding, still trying

The mounted directory can not be `ls` and `df -hT` gets stuck too.

The latest lines from the nfs3 log in the hadoop logs directory:

2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
user map size: 35
2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
group map size: 54
2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have 
to change stable write to unstable write:FILE_SYNC
2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update 
cache now
2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing 
static UID/GID mapping because '/etc/nfs.map' does not exist.
2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
user map size: 35
2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
group map size: 54
2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow 
ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 
status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: 
[10.0.3.172:50010, 10.0.3.176:50010]
2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block 
BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
java.io.IOException: Bad response ERROR for block 
BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 from datanode 
10.0.3.176:50010
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828)
2014-10-02 06:07:00,368 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 in 
pipeline 10.0.3.172:50010, 10.0.3.176:50010: bad datanode 10.0.3.176:50010

The logs seems suggest 10.0.3.176 is bad. However, from the `hdfs dfsadmin 
-report`, all nodes in the cluster seems working.

Any help will be appreciated. Thanks in advance.

  was:
We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on