Hi,
We encounter a problem very similar to this one: https://www.mail-archive.com/search?l=user@spark.apache.org&q=subject:%22Spark+task+hangs+infinitely+when+accessing+S3+from+AWS%22&o=newest&f=1 When reading large amount of data from S3, one or several tasks hung. It doesn't happen every time, but pretty consistently about at least 1 out of 3 times. Spark 1.5 mesos slaves: 40 amazon 3r.xlarge (4 core, 30 GB) machines. total data read from S3: ~380 GB *spark config that's not default:* spark.mesos.coarse = true --conf spark.sql.shuffle.partitions=300 --conf spark.executor.memory=25G --conf spark.sql.tungsten.enabled=false *The thread dump of the hanging task:* Executor task launch worker-3[1] where [1] java.net.SocketInputStream.socketRead0 (native method) [2] java.net.SocketInputStream.socketRead (SocketInputStream.java:116) [3] java.net.SocketInputStream.read (SocketInputStream.java:170) [4] java.net.SocketInputStream.read (SocketInputStream.java:141) [5] sun.security.ssl.InputRecord.readFully (InputRecord.java:465) [6] sun.security.ssl.InputRecord.read (InputRecord.java:503) [7] sun.security.ssl.SSLSocketImpl.readRecord (SSLSocketImpl.java:961) [8] sun.security.ssl.SSLSocketImpl.performInitialHandshake (SSLSocketImpl.java:1,363) [9] sun.security.ssl.SSLSocketImpl.startHandshake (SSLSocketImpl.java:1,391) [10] sun.security.ssl.SSLSocketImpl.startHandshake (SSLSocketImpl.java:1,375) [11] org.apache.http.conn.ssl.SSLSocketFactory.connectSocket (SSLSocketFactory.java:533) [12] org.apache.http.conn.ssl.SSLSocketFactory.connectSocket (SSLSocketFactory.java:401) [13] org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection (DefaultClientConnectionOperator.java:177) [14] org.apache.http.impl.conn.ManagedClientConnectionImpl.open (ManagedClientConnectionImpl.java:304) [15] org.apache.http.impl.client.DefaultRequestDirector.tryConnect (DefaultRequestDirector.java:610) [16] org.apache.http.impl.client.DefaultRequestDirector.execute (DefaultRequestDirector.java:445) [17] org.apache.http.impl.client.AbstractHttpClient.doExecute (AbstractHttpClient.java:863) [18] org.apache.http.impl.client.CloseableHttpClient.execute (CloseableHttpClient.java:82) [19] org.apache.http.impl.client.CloseableHttpClient.execute (CloseableHttpClient.java:57) [20] com.amazonaws.http.AmazonHttpClient.executeHelper (AmazonHttpClient.java:384) [21] com.amazonaws.http.AmazonHttpClient.execute (AmazonHttpClient.java:232) [22] com.amazonaws.services.s3.AmazonS3Client.invoke (AmazonS3Client.java:3,528) [23] com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata (AmazonS3Client.java:976) [24] com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata (AmazonS3Client.java:956) [25] org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus (S3AFileSystem.java:892) [26] org.apache.hadoop.fs.s3a.S3AFileSystem.open (S3AFileSystem.java:373) [27] org.apache.hadoop.fs.FileSystem.open (FileSystem.java:711) [28] org.apache.hadoop.mapred.LineRecordReader.<init> (LineRecordReader.java:93) [29] org.apache.hadoop.mapred.TextInputFormat.getRecordReader (TextInputFormat.java:54) [30] org.apache.spark.rdd.HadoopRDD$$anon$1.<init> (HadoopRDD.scala:239) [31] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:216) [32] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:101) [33] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297) [34] org.apache.spark.rdd.RDD.iterator (RDD.scala:264) [35] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:38) [36] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297) [37] org.apache.spark.rdd.RDD.iterator (RDD.scala:264) [38] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:38) [39] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297) [40] org.apache.spark.rdd.RDD.iterator (RDD.scala:264) [41] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:38) [42] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297) [43] org.apache.spark.rdd.RDD.iterator (RDD.scala:264) [44] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:38) [45] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297) [46] org.apache.spark.rdd.RDD.iterator (RDD.scala:264) [47] org.apache.spark.scheduler.ShuffleMapTask.runTask (ShuffleMapTask.scala:73) [48] org.apache.spark.scheduler.ShuffleMapTask.runTask (ShuffleMapTask.scala:41) [49] org.apache.spark.scheduler.Task.run (Task.scala:88) [50] org.apache.spark.executor.Executor$TaskRunner.run (Executor.scala:214) [51] java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1,142) [52] java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) [53] java.lang.Thread.run (Thread.java:745) *In the mesos-slave where the task is hung:* *lsof -p 27391 | grep amazon* java 27391 matrix-data-pipes-api 155u IPv6 611285 0t0 TCP devu-saPerf3r-mesosslave-10-1-69-91:35665->s3-1-w.amazonaws.com:https (ESTABLISHED) *in gdb, info threads:* 12 Thread 0x7f09d5741700 (LWP 27493) "java" 0x00007f0a016827eb in __libc_recv (fd=155, buf=0x7f09d572d6b0, n=5, flags=-1) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33 *in jdb:* *print the request to S3:* Executor task launch worker-3[20] dump request request = { resourcePath: "devu/saPerf3r-bb6bf633-e03f-43d7-b049-1e0166762e4b/agent_smith_passthrough_performance12/v2/_full/part-r-00276-65a9fb14-28ac-4ac7-9901-c73f4754938d" parameters: instance of java.util.HashMap(id=8031) headers: instance of java.util.HashMap(id=8032) endpoint: instance of java.net.URI(id=8033) serviceName: "Amazon S3" originalRequest: instance of com.amazonaws.services.s3.model.GetObjectMetadataRequest(id=8035) httpMethod: instance of com.amazonaws.http.HttpMethodName(id=8036) content: null timeOffset: 0 metrics: instance of com.amazonaws.util.AWSRequestMetrics(id=8023) } *Print the local variables in SSLSocketFactory.connectSocket:* Executor task launch worker-3[11] locals Method arguments: connectTimeout = 50000 socket = instance of sun.security.ssl.SSLSocketImpl(id=8016) host = instance of org.apache.http.HttpHost(id=8017) remoteAddress = instance of org.apache.http.conn.HttpInetSocketAddress(id=8018) localAddress = null context = null Local variables: sock = instance of sun.security.ssl.SSLSocketImpl(id=8016) sslsock = instance of sun.security.ssl.SSLSocketImpl(id=8016) I'm not able to print the local variable info for SocketInputStream.socketRead in jdb because the jdk is not compiled with -g. Could you please advise what the problem might be? Please let me know if I need to provide other information or run other experiment to help diagnose the problem. Thanks a lot in advanced! Sa