Suren Hiraman created SPARK-2202:
------------------------------------

             Summary: saveAsTextFile hangs on final 2 tasks
                 Key: SPARK-2202
                 URL: https://issues.apache.org/jira/browse/SPARK-2202
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.0.0
         Environment: CentOS 5.7
16 nodes, 24 cores per node, 14g RAM per executor
            Reporter: Suren Hiraman
            Priority: Blocker


I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
data.

The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
tasks, always on the same node.

It seems that the 2 tasks are waiting for data from a remote task/RDD partition.

After about 2 hours or so, the stuck tasks get a closed connection exception 
and you can see the remote side logging that as well. Log lines are below.

My custom settings are:

        conf.set("spark.executor.memory", "14g")     // TODO make this 
configurable
        
        // shuffle configs
        conf.set("spark.default.parallelism", "320")
        conf.set("spark.shuffle.file.buffer.kb", "200")
        conf.set("spark.reducer.maxMbInFlight", "96")
        
        conf.set("spark.rdd.compress","true")
        
        conf.set("spark.worker.timeout","180")
        
        // akka settings
        conf.set("spark.akka.threads", "300")
        conf.set("spark.akka.timeout", "180")
        conf.set("spark.akka.frameSize", "100")
        conf.set("spark.akka.batchSize", "30")
        conf.set("spark.akka.askTimeout", "30")
        
        // block manager
        conf.set("spark.storage.blockManagerTimeoutIntervalMs", "180000")
        conf.set("spark.blockManagerHeartBeatMs", "80000")


"STUCK" WORKER
14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
connection to ConnectionManagerId(172.16.25.103,57626)

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcher.read0(Native Method)

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)

at sun.nio.ch.IOUtil.read(IOUtil.java:224)

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)

at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)


REMOTE WORKER

14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection 
to ConnectionManagerId(172.16.25.124,55610)

14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to