Hello!

I've some weird problems with Spark running on top of Yarn. (Spark 2.2
on Cloudera CDH 5.12)

There are a lot of "java.net.SocketException: Network is unreachable" in
the executors, part of a log file:
http://support.l3s.de/~zab/spark-errors.txt  and the jobs also fail at
rather random times with only several MBs up the GBs worth of the above
errors.

In the driver output I get the following:
http://support.l3s.de/~zab/spark-errors.txt

These errors usually go hand in hand with some dropped packets, but I
would assume that TCP can actually handle that?

The network backend is based on Intel OmniPath hardware running in
connected mode with a MTU of 1500 (just as a save default at the
moment).

The nodes can also ping each other without a problem, their DNS
configuration is also the same. Same hosts file deployed to all hosts
via Ansible and same data configured in unbound DNS forwarder.

I've several code snippets that manifest with that problem, current
example:

    val data = session.read
      .schema(schema)
      .option("sep", "\t")
      .option("header", false)
      .csv(config.input)
      .as[LinkRecord]

    val filtered = data
      .filter(_.elem === "A@/href")

    val transformed = filtered.map(e => e.copy(date = e.date.slice(0, 10) + 
"T00:00:00.000-00:00"))
      .dropDuplicates(Array("src", "date", "dst"))
    transformed.write
      .option("sep", "\t")
      .option("header", "false")
      .option("compression", "gzip")
      .mode(SaveMode.Append)
      .csv(config.output)

The input data is roughly 2.1TB (~ 500 billion lines I think) and on HDFS.

I'm honestly running out of ideas on how to debug this problem. I'm half
thinking that the above errors are just masking the real problem.

I would greatly appreciate any help!

ps.
Please CC me, since I'm not subscribed to the mailing list.

Kinds regards,

      Jan


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to