Task failures and other problems

Jan-Hendrik Zab Thu, 09 Nov 2017 11:10:51 -0800

Hello!

This might not be the perfect list for the issue, but I tried user@
previously with the same issue, but with a bit less information to no
avail.


So I'm hoping someone here can point me into the right direction.

We're using Spark 2.2 on CDH 5.13 (Hadoop 2.6 with patches) and a lot of
our jobs fail, even when the jobs are super simple. For instance: [0]

We get two kinds of "errors", one where a task is actually marked as
failed in the web ui [1]. Basically:

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
BP-1464899749-10.10.0.1-1378382027840:blk_1084837359_1099539407729
file=/data/ia/derivatives/de/links/TA/part-68879.gz

See link for the stack trace.

When I check the block via "hdfs fsck -blockId blk_1084837359" all is
well, I can also `-cat' the data into `wc'. It's a valid GZIP file.

The other kind of "error" we are getting are [2]:

DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 
2648.1420920453265 msec.
BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: Network is unreachable
DFSClient: Failed to connect to /10.12.1.26:50010 for block, add to
deadNodes and continue. java.net.SocketException: Network is unreachable

These are logged in the stderr of _some_ of the executors.

I know that both things (at least to me) look more like a problem with
HDFS and/or CDH. But we tried reading data via mapred jobs that
essentially just manually opened the GZIP files, read them and printed
some status info and those didn't produce any kind of error. The only
thing we noticed was that sometimes the read() call apparently stalled
for several minutes. But we couldn't identify a cause so far. And we
also didn't see any errors in the CDH logs except maybe the following
informational messages:

Likely the client has stopped reading, disconnecting it 
(node24.ib:50010:DataXceiver error processing READ_BLOCK operation  src: 
/10.12.1.20:46518 dst: /10.12.1.24:50010); java.net.SocketTimeoutException: 
120004 millis timeout while waiting for channel to be ready for write. ch : 
java.nio.channels.SocketChannel[connected local=/10.12.1.24:50010 
remote=/10.12.1.20:46518]

All the systems (masters and nodes) can reach each other on the
(infiniband) network. The systems communicate only over that one network
(ie. datanodes only bind to one IP). /etc/hosts files are also the same
on all systems and were distributed via ansible. But we also have a
central DNS with the same data (and for PTR resolution) all systems are
using.

The cluster has 37 nodes and 2 masters.

Suggestions are very welcome. :-)

[0] - http://www.l3s.de/~zab/link_converter.scala
[1] - http://www.l3s.de/~zab/spark-errors-2.txt
[2] - http://www.l3s.de/~zab/spark-errors.txt

Best,
        -jhz

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Task failures and other problems

Reply via email to