Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Patrick Wendell Mon, 14 Jul 2014 15:58:20 -0700

Hey Nishkam,

Aaron's fix should prevent two concurrent accesses to getJobConf (and
the Hadoop code therein). But if there is code elsewhere that tries to
mutate the configuration, then I could see how we might still have the
ConcurrentModificationException.


I looked at your patch for HADOOP-10456 and the only example you give
is of the data being accessed inside of getJobConf. Is it accessed
somewhere else too from Spark that you are aware of?

https://issues.apache.org/jira/browse/HADOOP-10456

- Patrick

On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com> wrote:
> Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
> help. I suspect we will start seeing the ConcurrentModificationException
> again. The right fix has gone into Hadoop through 10456. Unfortunately, I
> don't have any bright ideas on how to synchronize this at the Spark level
> without the risk of deadlocks.
>
>
> On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <ilike...@gmail.com> wrote:
>
>> The full jstack would still be useful, but our current working theory is
>> that this is due to the fact that Configuration#loadDefaults goes through
>> every Configuration object that was ever created (via
>> Configuration.REGISTRY) and locks it, thus introducing a dependency from
>> new Configuration to old, otherwise unrelated, Configuration objects that
>> our locking did not anticipate.
>>
>> I have created https://github.com/apache/spark/pull/1409 to hopefully fix
>> this bug.
>>
>>
>> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pwend...@gmail.com>
>> wrote:
>>
>> > Hey Cody,
>> >
>> > This Jstack seems truncated, would you mind giving the entire stack
>> > trace? For the second thread, for instance, we can't see where the
>> > lock is being acquired.
>> >
>> > - Patrick
>> >
>> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>> > <cody.koenin...@mediacrossing.com> wrote:
>> > > Hi all, just wanted to give a heads up that we're seeing a reproducible
>> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>> > >
>> > > If jira is a better place for this, apologies in advance - figured
>> > talking
>> > > about it on the mailing list was friendlier than randomly (re)opening
>> > jira
>> > > tickets.
>> > >
>> > > I know Gary had mentioned some issues with 1.0.1 on the mailing list,
>> > once
>> > > we got a thread dump I wanted to follow up.
>> > >
>> > > The thread dump shows the deadlock occurs in the synchronized block of
>> > code
>> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>> > >
>> > > Relevant portions of the thread dump are summarized below, we can
>> provide
>> > > the whole dump if it's useful.
>> > >
>> > > Found one Java-level deadlock:
>> > > =============================
>> > > "Executor task launch worker-1":
>> > >   waiting to lock monitor 0x00007f250400c520 (object
>> 0x00000000fae7dc30,
>> > a
>> > > org.apache.hadoop.co
>> > > nf.Configuration),
>> > >   which is held by "Executor task launch worker-0"
>> > > "Executor task launch worker-0":
>> > >   waiting to lock monitor 0x00007f2520495620 (object
>> 0x00000000faeb4fc8,
>> > a
>> > > java.lang.Class),
>> > >   which is held by "Executor task launch worker-1"
>> > >
>> > >
>> > > "Executor task launch worker-1":
>> > >         at
>> > >
>> >
>> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
>> > >         - waiting to lock <0x00000000fae7dc30> (a
>> > > org.apache.hadoop.conf.Configuration)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
>> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>> > > org.apache.hadoop.conf.Configurati
>> > > on)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
>> > > )
>> > >         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > > Method)
>> > >         at
>> > >
>> >
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>> > > java:57)
>> > >         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > > Method)
>> > >         at
>> > >
>> >
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>> > > java:57)
>> > >         at
>> > >
>> >
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
>> > > sorImpl.java:45)
>> > >         at
>> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>> > >         at java.lang.Class.newInstance0(Class.java:374)
>> > >         at java.lang.Class.newInstance(Class.java:327)
>> > >         at
>> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
>> > >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
>> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>> > > org.apache.hadoop.fs.FileSystem)
>> > >         at
>> > >
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>> > >         at
>> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>> > >         at
>> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>> > >         at
>> > > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > >
>> >
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>> > >
>> > >
>> > >
>> > > ...elided...
>> > >
>> > >
>> > > "Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
>> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>> > >    java.lang.Thread.State: BLOCKED (on object monitor)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
>> > >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
>> > > org.apache.hadoop.fs.FileSystem)
>> > >         at
>> > >
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>> > >         at
>> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>> > >         at
>> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>> > >         at
>> > > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > >
>> >
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>> >
>>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Reply via email to