It seemed you had data skewed issue since shuffle read size for executor 4 is almost 2 times than other executors and GC time 11s almost 15 to 20 times than others.
Kathleen Sent from my iPhone > On Oct 7, 2018, at 5:24 AM, 阎志涛 <tony....@tendcloud.com> wrote: > > Hi, All, > I am running Spark 2.1 on Hadoop 2.7.2 with yarn. While executing spark > tasks, some executor keep running forever without success. From the following > screenshot: > <image002.jpg> > We can see that executor 4 keep running for 26 minutes and the shuffle read > size/records keep unchanged for 26mins too. Threaddump for the thread is as > following: > <image004.jpg> > > <image009.jpg> > > The linux version is: Linux version 4.14.62-70.117.amzn2.x86_64 > (mockbuild@ip-10-0-1-79) and jdk version is Oracle JDK 1.8.0_181. With jstack > on the machine, I can see following thread dump: > > "Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 > tid=0x0000000001230800 nid=0x1fc runnable [0x00007fba0e600000] > java.lang.Thread.State: RUNNABLE > at java.lang.StringCoding.encode(StringCoding.java:364) > at java.lang.String.getBytes(String.java:941) > at > org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > I wonder why this happened? Is it related to my environment of a bug of Spark? > > Thanks and Regards, > Tony > > 阎志涛 > 研发副总裁 > > M + 86-139 1181 5695 > Wechat zhitao_yan > > 北京腾云天下科技有限公司 > 北京市东直门外大街39号院2号楼608室,100027 > > TalkingData.com >