Hi,

We encounter a problem very similar to this one:

https://www.mail-archive.com/search?l=user@spark.apache.org&q=subject:%22Spark+task+hangs+infinitely+when+accessing+S3+from+AWS%22&o=newest&f=1


When reading large amount of data from S3, one or several tasks hung. It
doesn't happen every time, but  pretty consistently about at least 1 out of
3 times.


Spark 1.5

mesos slaves: 40 amazon 3r.xlarge (4 core, 30 GB) machines.

total data read from S3: ~380 GB


*spark config that's not default:*

spark.mesos.coarse = true

--conf spark.sql.shuffle.partitions=300

--conf spark.executor.memory=25G

--conf spark.sql.tungsten.enabled=false


*The thread dump of the hanging task:*

Executor task launch worker-3[1] where

  [1] java.net.SocketInputStream.socketRead0 (native method)

  [2] java.net.SocketInputStream.socketRead (SocketInputStream.java:116)

  [3] java.net.SocketInputStream.read (SocketInputStream.java:170)

  [4] java.net.SocketInputStream.read (SocketInputStream.java:141)

  [5] sun.security.ssl.InputRecord.readFully (InputRecord.java:465)

  [6] sun.security.ssl.InputRecord.read (InputRecord.java:503)

  [7] sun.security.ssl.SSLSocketImpl.readRecord (SSLSocketImpl.java:961)

  [8] sun.security.ssl.SSLSocketImpl.performInitialHandshake
(SSLSocketImpl.java:1,363)

  [9] sun.security.ssl.SSLSocketImpl.startHandshake
(SSLSocketImpl.java:1,391)

  [10] sun.security.ssl.SSLSocketImpl.startHandshake
(SSLSocketImpl.java:1,375)

  [11] org.apache.http.conn.ssl.SSLSocketFactory.connectSocket
(SSLSocketFactory.java:533)

  [12] org.apache.http.conn.ssl.SSLSocketFactory.connectSocket
(SSLSocketFactory.java:401)

  [13]
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection
(DefaultClientConnectionOperator.java:177)

  [14] org.apache.http.impl.conn.ManagedClientConnectionImpl.open
(ManagedClientConnectionImpl.java:304)

  [15] org.apache.http.impl.client.DefaultRequestDirector.tryConnect
(DefaultRequestDirector.java:610)

  [16] org.apache.http.impl.client.DefaultRequestDirector.execute
(DefaultRequestDirector.java:445)

  [17] org.apache.http.impl.client.AbstractHttpClient.doExecute
(AbstractHttpClient.java:863)

  [18] org.apache.http.impl.client.CloseableHttpClient.execute
(CloseableHttpClient.java:82)

  [19] org.apache.http.impl.client.CloseableHttpClient.execute
(CloseableHttpClient.java:57)

  [20] com.amazonaws.http.AmazonHttpClient.executeHelper
(AmazonHttpClient.java:384)

  [21] com.amazonaws.http.AmazonHttpClient.execute
(AmazonHttpClient.java:232)

  [22] com.amazonaws.services.s3.AmazonS3Client.invoke
(AmazonS3Client.java:3,528)

  [23] com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata
(AmazonS3Client.java:976)

  [24] com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata
(AmazonS3Client.java:956)

  [25] org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus
(S3AFileSystem.java:892)

  [26] org.apache.hadoop.fs.s3a.S3AFileSystem.open (S3AFileSystem.java:373)

  [27] org.apache.hadoop.fs.FileSystem.open (FileSystem.java:711)

  [28] org.apache.hadoop.mapred.LineRecordReader.<init>
(LineRecordReader.java:93)

  [29] org.apache.hadoop.mapred.TextInputFormat.getRecordReader
(TextInputFormat.java:54)

  [30] org.apache.spark.rdd.HadoopRDD$$anon$1.<init> (HadoopRDD.scala:239)

  [31] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:216)

  [32] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:101)

  [33] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [34] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [35] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [36] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [37] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [38] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [39] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [40] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [41] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [42] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [43] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [44] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [45] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [46] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [47] org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:73)

  [48] org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:41)

  [49] org.apache.spark.scheduler.Task.run (Task.scala:88)

  [50] org.apache.spark.executor.Executor$TaskRunner.run
(Executor.scala:214)

  [51] java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1,142)

  [52] java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:617)

  [53] java.lang.Thread.run (Thread.java:745)


*In the mesos-slave where the task is hung:*

*lsof -p 27391 | grep amazon*

java    27391 matrix-data-pipes-api  155u  IPv6             611285
0t0    TCP devu-saPerf3r-mesosslave-10-1-69-91:35665->s3-1-w.amazonaws.com:https
(ESTABLISHED)


*in gdb, info threads:*

12   Thread 0x7f09d5741700 (LWP 27493) "java" 0x00007f0a016827eb in
__libc_recv (fd=155, buf=0x7f09d572d6b0, n=5, flags=-1) at
../sysdeps/unix/sysv/linux/x86_64/recv.c:33


*in jdb:*

*print the request to S3:*

Executor task launch worker-3[20] dump request

 request = {

    resourcePath:
"devu/saPerf3r-bb6bf633-e03f-43d7-b049-1e0166762e4b/agent_smith_passthrough_performance12/v2/_full/part-r-00276-65a9fb14-28ac-4ac7-9901-c73f4754938d"

    parameters: instance of java.util.HashMap(id=8031)

    headers: instance of java.util.HashMap(id=8032)

    endpoint: instance of java.net.URI(id=8033)

    serviceName: "Amazon S3"

    originalRequest: instance of
com.amazonaws.services.s3.model.GetObjectMetadataRequest(id=8035)

    httpMethod: instance of com.amazonaws.http.HttpMethodName(id=8036)

    content: null

    timeOffset: 0

    metrics: instance of com.amazonaws.util.AWSRequestMetrics(id=8023)

}


*Print the local variables in SSLSocketFactory.connectSocket:*

Executor task launch worker-3[11] locals

Method arguments:

connectTimeout = 50000

socket = instance of sun.security.ssl.SSLSocketImpl(id=8016)

host = instance of org.apache.http.HttpHost(id=8017)

remoteAddress = instance of
org.apache.http.conn.HttpInetSocketAddress(id=8018)

localAddress = null

context = null

Local variables:

sock = instance of sun.security.ssl.SSLSocketImpl(id=8016)

sslsock = instance of sun.security.ssl.SSLSocketImpl(id=8016)


I'm not able to print the local variable info
for SocketInputStream.socketRead in jdb because the jdk is not compiled
with -g.


Could you please advise what the problem might be? Please let me know if I
need to provide other information or run other experiment to help diagnose
the problem.


Thanks a lot in advanced!

Sa

Reply via email to