This has indeed been caused by the network backend that dropped several outgoing packets. I'm not sure why this wasn't "caught" by TCP.
We ended up with setting send_queue_size=256 recv_queue_size=512 for ib_ipoib and krcvqs=4 fpr hfi1. We also updated our OmniPath switch firmware to the current version. We still have _some_ dropped packets, but so far jobs haven't died because of it. As far as I can tell, it was also only Spark manifesting the problem. The usual Hadoop mapred jobs were running fine. Best, -jhz On Fri, 18 Aug 2017 19:59:26 +0200 Jan-Hendrik Zab <z...@l3s.de> wrote: > Hello! > > I've some weird problems with Spark running on top of Yarn. (Spark 2.2 > on Cloudera CDH 5.12) > > There are a lot of "java.net.SocketException: Network is unreachable" > in the executors, part of a log file: > http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at > rather random times with only several MBs up the GBs worth of the > above errors. > > In the driver output I get the following: > http://support.l3s.de/~zab/spark-errors.txt > > These errors usually go hand in hand with some dropped packets, but I > would assume that TCP can actually handle that? > > The network backend is based on Intel OmniPath hardware running in > connected mode with a MTU of 1500 (just as a save default at the > moment). > > The nodes can also ping each other without a problem, their DNS > configuration is also the same. Same hosts file deployed to all hosts > via Ansible and same data configured in unbound DNS forwarder. > > I've several code snippets that manifest with that problem, current > example: > > val data = session.read > .schema(schema) > .option("sep", "\t") > .option("header", false) > .csv(config.input) > .as[LinkRecord] > > val filtered = data > .filter(_.elem === "A@/href") > > val transformed = filtered.map(e => e.copy(date = > e.date.slice(0, 10) + > "T00:00:00.000-00:00")) .dropDuplicates(Array("src", "date", "dst")) > transformed.write .option("sep", "\t") > .option("header", "false") > .option("compression", "gzip") > .mode(SaveMode.Append) > .csv(config.output) > > The input data is roughly 2.1TB (~ 500 billion lines I think) and on > HDFS. > > I'm honestly running out of ideas on how to debug this problem. I'm > half thinking that the above errors are just masking the real problem. > > I would greatly appreciate any help! > > ps. > Please CC me, since I'm not subscribed to the mailing list. > > Kinds regards, > > Jan > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > -- Leibniz Universität Hannover Institut für Verteilte Systeme Appelstrasse 4 - 30167 Hannover Phone: +49 (0)511 762 - 17706 --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org