Hi folks,

I'm experiencing the exact symptoms of HDFS-770 
(https://issues.apache.org/jira/browse/HDFS-770) using Spark and a basic HDFS 
deployment. Everything is running locally on a single machine. I'm using Hadoop 
2.7.3. My HDFS deployment consists of a single 8 TB disk with replication 
disabled, otherwise everything is vanilla Hadoop 2.7.3. My Spark job uses a 
Hive ORC writer to write a  dataset to disk. The dataset itself is < 100 GB 
uncompressed, ~17 GB compressed.

It does not appear to be a Spark issue. The datanode's logs show it receives 
the first ~500 packets for a block, then nothing for a minute, then the default 
channel read timeout of 60000 ms causes the exception:

2016-12-19 18:36:50,632 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632 
received exception java.net.SocketTimeoutException: 60000 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 
remote=/127.0.0.1:55866]
2016-12-19 18:36:50,632 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
lamport.grierforensics.com:50010:DataXceiver error processing WRITE_BLOCK 
operation  src: /127.0.0.1:55866 dst: /127.0.0.1:50010
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        ...

On the Spark side, all is well until the datanode's socket exception results in 
Spark experiencing a DFSOutputStream ResponseProcessor exception, followed by 
Spark aborting due to all datanodes being bad:

2016-12-19 18:36:59.014 WARN DFSClient: DFSOutputStream ResponseProcessor 
exception  for block 
BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632
java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)

...
Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. 
Aborting...
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)

I haven't tried adjusting the timeout yet for the same reason specified by the 
reporter of HDFS-770: I'm running everything locally, with no other tasks 
running on the system so why would I need a socket read timeout greater than 60 
seconds? I haven't observed any CPU, memory or disk bottlenecks.

Lowering the number of cores used by Spark does help alleviate the problem, but 
doesn't eliminate it, which led me to believe the issue may be disk contention 
(i.e. too many client writers?), but again, I haven't observed any disk IO 
bottlenecks at all.

Does anyone else still experience HDFS-770 
(https://issues.apache.org/jira/browse/HDFS-770) and is there a general 
approach/solution?

Thanks

---
Joe Naegele
Grier Forensics



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Reply via email to