Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Andrew Ash Mon, 14 Jul 2014 21:06:06 -0700

I'm not sure either of those PRs will fix the concurrent adds to
Configuration issue I observed. I've got a stack trace and writeup I'll
share in an hour or two (traveling today).
On Jul 14, 2014 9:50 PM, "scwf" <wangf...@huawei.com> wrote:


> hi，Cody
>   i met this issue days before and i post a PR for this(
> https://github.com/apache/spark/pull/1385)
> it's very strange that if i synchronize conf it will deadlock but it is ok
> when synchronize initLocalJobConfFuncOpt
>
>
>  Here's the entire jstack output.
>>
>>
>> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pwend...@gmail.com
>> <mailto:pwend...@gmail.com>> wrote:
>>
>>     Hey Cody,
>>
>>     This Jstack seems truncated, would you mind giving the entire stack
>>     trace? For the second thread, for instance, we can't see where the
>>     lock is being acquired.
>>
>>     - Patrick
>>
>>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>>     <cody.koenin...@mediacrossing.com <mailto:cody.koeninger@
>> mediacrossing.com>> wrote:
>>      > Hi all, just wanted to give a heads up that we're seeing a
>> reproducible
>>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>>      >
>>      > If jira is a better place for this, apologies in advance - figured
>> talking
>>      > about it on the mailing list was friendlier than randomly
>> (re)opening jira
>>      > tickets.
>>      >
>>      > I know Gary had mentioned some issues with 1.0.1 on the mailing
>> list, once
>>      > we got a thread dump I wanted to follow up.
>>      >
>>      > The thread dump shows the deadlock occurs in the synchronized
>> block of code
>>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>>      >
>>      > Relevant portions of the thread dump are summarized below, we can
>> provide
>>      > the whole dump if it's useful.
>>      >
>>      > Found one Java-level deadlock:
>>      > =============================
>>      > "Executor task launch worker-1":
>>      >   waiting to lock monitor 0x00007f250400c520 (object
>> 0x00000000fae7dc30, a
>>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
>>      > nf.Configuration),
>>      >   which is held by "Executor task launch worker-0"
>>      > "Executor task launch worker-0":
>>      >   waiting to lock monitor 0x00007f2520495620 (object
>> 0x00000000faeb4fc8, a
>>      > java.lang.Class),
>>      >   which is held by "Executor task launch worker-1"
>>      >
>>      >
>>      > "Executor task launch worker-1":
>>      >         at
>>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(
>> Configuration.java:791)
>>      >         - waiting to lock <0x00000000fae7dc30> (a
>>      > org.apache.hadoop.conf.Configuration)
>>      >         at
>>      > org.apache.hadoop.conf.Configuration.addDefaultResource(
>> Configuration.java:690)
>>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>>      > org.apache.hadoop.conf.Configurati
>>      > on)
>>      >         at
>>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(
>> HdfsConfiguration.java:34)
>>      >         at
>>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>
>> (DistributedFileSystem.java:110
>>      > )
>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> newInstance0(Native
>>      > Method)
>>      >         at
>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.
>>      > java:57)
>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> newInstance0(Native
>>      > Method)
>>      >         at
>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.
>>      > java:57)
>>      >         at
>>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> DelegatingConstructorAcces
>>      > sorImpl.java:45)
>>      >         at java.lang.reflect.Constructor.
>> newInstance(Constructor.java:525)
>>      >         at java.lang.Class.newInstance0(Class.java:374)
>>      >         at java.lang.Class.newInstance(Class.java:327)
>>      >         at java.util.ServiceLoader$LazyIterator.next(
>> ServiceLoader.java:373)
>>      >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> FileSystem.java:2364)
>>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>>      > org.apache.hadoop.fs.FileSystem)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> FileSystem.java:2375)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> FileSystem.java:2392)
>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> FileSystem.java:89)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> FileSystem.java:2431)
>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> FileSystem.java:2413)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:368)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:167)
>>      >         at
>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> JobConf.java:587)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:315)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:288)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> 1.apply(HadoopRDD.scala:145)
>>      >
>>      >
>>      >
>>      > ...elided...
>>      >
>>      >
>>      > "Executor task launch worker-0" daemon prio=10
>> tid=0x0000000001e71800
>>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>>      >    java.lang.Thread.State: BLOCKED (on object monitor)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> FileSystem.java:2362)
>>      >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
>> for
>>      > org.apache.hadoop.fs.FileSystem)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> FileSystem.java:2375)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> FileSystem.java:2392)
>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> FileSystem.java:89)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> FileSystem.java:2431)
>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> FileSystem.java:2413)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:368)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:167)
>>      >         at
>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> JobConf.java:587)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:315)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:288)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> 1.apply(HadoopRDD.scala:145)
>>
>>
>>
>
> --
>
> Best Regards
> Fei Wang
>
> ------------------------------------------------------------
> --------------------
>
>
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Reply via email to