Hello! I've some weird problems with Spark running on top of Yarn. (Spark 2.2 on Cloudera CDH 5.12)
There are a lot of "java.net.SocketException: Network is unreachable" in the executors, part of a log file: http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at rather random times with only several MBs up the GBs worth of the above errors. In the driver output I get the following: http://support.l3s.de/~zab/spark-errors.txt These errors usually go hand in hand with some dropped packets, but I would assume that TCP can actually handle that? The network backend is based on Intel OmniPath hardware running in connected mode with a MTU of 1500 (just as a save default at the moment). The nodes can also ping each other without a problem, their DNS configuration is also the same. Same hosts file deployed to all hosts via Ansible and same data configured in unbound DNS forwarder. I've several code snippets that manifest with that problem, current example: val data = session.read .schema(schema) .option("sep", "\t") .option("header", false) .csv(config.input) .as[LinkRecord] val filtered = data .filter(_.elem === "A@/href") val transformed = filtered.map(e => e.copy(date = e.date.slice(0, 10) + "T00:00:00.000-00:00")) .dropDuplicates(Array("src", "date", "dst")) transformed.write .option("sep", "\t") .option("header", "false") .option("compression", "gzip") .mode(SaveMode.Append) .csv(config.output) The input data is roughly 2.1TB (~ 500 billion lines I think) and on HDFS. I'm honestly running out of ideas on how to debug this problem. I'm half thinking that the above errors are just masking the real problem. I would greatly appreciate any help! ps. Please CC me, since I'm not subscribed to the mailing list. Kinds regards, Jan --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org