This has indeed been caused by the network backend that dropped several
outgoing packets. I'm not sure why this wasn't "caught" by TCP.

We ended up with setting send_queue_size=256 recv_queue_size=512 for
ib_ipoib and krcvqs=4 fpr hfi1. We also updated our OmniPath switch
firmware to the current version. We still have _some_ dropped packets,
but so far jobs haven't died because of it.

As far as I can tell, it was also only Spark manifesting the problem.
The usual Hadoop mapred jobs were running fine.

Best,
        -jhz

On Fri, 18 Aug 2017 19:59:26 +0200
Jan-Hendrik Zab <z...@l3s.de> wrote:

> Hello!
> 
> I've some weird problems with Spark running on top of Yarn. (Spark 2.2
> on Cloudera CDH 5.12)
> 
> There are a lot of "java.net.SocketException: Network is unreachable"
> in the executors, part of a log file:
> http://support.l3s.de/~zab/spark-errors.txt  and the jobs also fail at
> rather random times with only several MBs up the GBs worth of the
> above errors.
> 
> In the driver output I get the following:
> http://support.l3s.de/~zab/spark-errors.txt
> 
> These errors usually go hand in hand with some dropped packets, but I
> would assume that TCP can actually handle that?
> 
> The network backend is based on Intel OmniPath hardware running in
> connected mode with a MTU of 1500 (just as a save default at the
> moment).
> 
> The nodes can also ping each other without a problem, their DNS
> configuration is also the same. Same hosts file deployed to all hosts
> via Ansible and same data configured in unbound DNS forwarder.
> 
> I've several code snippets that manifest with that problem, current
> example:
> 
>      val data = session.read
>        .schema(schema)
>        .option("sep", "\t")
>        .option("header", false)
>        .csv(config.input)
>        .as[LinkRecord]
> 
>      val filtered = data
>        .filter(_.elem === "A@/href")
> 
>      val transformed = filtered.map(e => e.copy(date =
> e.date.slice(0, 10) +
> "T00:00:00.000-00:00")) .dropDuplicates(Array("src", "date", "dst"))
> transformed.write .option("sep", "\t")
>        .option("header", "false")
>        .option("compression", "gzip")
>        .mode(SaveMode.Append)
>        .csv(config.output)
> 
> The input data is roughly 2.1TB (~ 500 billion lines I think) and on
> HDFS.
> 
> I'm honestly running out of ideas on how to debug this problem. I'm
> half thinking that the above errors are just masking the real problem.
> 
> I would greatly appreciate any help!
> 
> ps.
> Please CC me, since I'm not subscribed to the mailing list.
> 
> Kinds regards,
> 
>        Jan
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



-- 
Leibniz Universität Hannover
Institut für Verteilte Systeme
Appelstrasse 4 - 30167 Hannover
Phone:  +49 (0)511 762 - 17706

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to