Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-15 Thread Cody Koeninger
We tested that patch from aarondav's branch, and are no longer seeing that
deadlock.  Seems to have solved the problem, at least for us.


On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew and Gary,

 Would you guys be able to test
 https://github.com/apache/spark/pull/1409/files and see if it solves
 your problem?

 - Patrick



Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Cody Koeninger
Hi all, just wanted to give a heads up that we're seeing a reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2

If jira is a better place for this, apologies in advance - figured talking
about it on the mailing list was friendlier than randomly (re)opening jira
tickets.

I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
we got a thread dump I wanted to follow up.

The thread dump shows the deadlock occurs in the synchronized block of code
that was changed in HadoopRDD.scala, for the Spark-1097 issue

Relevant portions of the thread dump are summarized below, we can provide
the whole dump if it's useful.

Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a
org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a
java.lang.Class),
  which is held by Executor task launch worker-1


Executor task launch worker-1:
at
org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
at
org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
sorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
- locked 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



...elided...


Executor task launch worker-0 daemon prio=10 tid=0x01e71800
nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
- waiting to lock 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
at

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Cody,

This Jstack seems truncated, would you mind giving the entire stack
trace? For the second thread, for instance, we can't see where the
lock is being acquired.

- Patrick

On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
cody.koenin...@mediacrossing.com wrote:
 Hi all, just wanted to give a heads up that we're seeing a reproducible
 deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2

 If jira is a better place for this, apologies in advance - figured talking
 about it on the mailing list was friendlier than randomly (re)opening jira
 tickets.

 I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
 we got a thread dump I wanted to follow up.

 The thread dump shows the deadlock occurs in the synchronized block of code
 that was changed in HadoopRDD.scala, for the Spark-1097 issue

 Relevant portions of the thread dump are summarized below, we can provide
 the whole dump if it's useful.

 Found one Java-level deadlock:
 =
 Executor task launch worker-1:
   waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a
 org.apache.hadoop.co
 nf.Configuration),
   which is held by Executor task launch worker-0
 Executor task launch worker-0:
   waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a
 java.lang.Class),
   which is held by Executor task launch worker-1


 Executor task launch worker-1:
 at
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
 - waiting to lock 0xfae7dc30 (a
 org.apache.hadoop.conf.Configuration)
 at
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
 - locked 0xfaca6ff8 (a java.lang.Class for
 org.apache.hadoop.conf.Configurati
 on)
 at
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
 )
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
 java:57)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
 java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
 sorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
 at java.lang.Class.newInstance0(Class.java:374)
 at java.lang.Class.newInstance(Class.java:327)
 at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
 at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
 at
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
 - locked 0xfaeb4fc8 (a java.lang.Class for
 org.apache.hadoop.fs.FileSystem)
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 at
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
 at
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



 ...elided...


 Executor task launch worker-0 daemon prio=10 tid=0x01e71800
 nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
 - waiting to lock 0xfaeb4fc8 (a java.lang.Class for
 org.apache.hadoop.fs.FileSystem)
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 at

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The full jstack would still be useful, but our current working theory is
that this is due to the fact that Configuration#loadDefaults goes through
every Configuration object that was ever created (via
Configuration.REGISTRY) and locks it, thus introducing a dependency from
new Configuration to old, otherwise unrelated, Configuration objects that
our locking did not anticipate.

I have created https://github.com/apache/spark/pull/1409 to hopefully fix
this bug.


On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire stack
 trace? For the second thread, for instance, we can't see where the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com wrote:
  Hi all, just wanted to give a heads up that we're seeing a reproducible
  deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
 
  If jira is a better place for this, apologies in advance - figured
 talking
  about it on the mailing list was friendlier than randomly (re)opening
 jira
  tickets.
 
  I know Gary had mentioned some issues with 1.0.1 on the mailing list,
 once
  we got a thread dump I wanted to follow up.
 
  The thread dump shows the deadlock occurs in the synchronized block of
 code
  that was changed in HadoopRDD.scala, for the Spark-1097 issue
 
  Relevant portions of the thread dump are summarized below, we can provide
  the whole dump if it's useful.
 
  Found one Java-level deadlock:
  =
  Executor task launch worker-1:
waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30,
 a
  org.apache.hadoop.co
  nf.Configuration),
which is held by Executor task launch worker-0
  Executor task launch worker-0:
waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8,
 a
  java.lang.Class),
which is held by Executor task launch worker-1
 
 
  Executor task launch worker-1:
  at
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
  - waiting to lock 0xfae7dc30 (a
  org.apache.hadoop.conf.Configuration)
  at
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
  - locked 0xfaca6ff8 (a java.lang.Class for
  org.apache.hadoop.conf.Configurati
  on)
  at
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
  at
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
  )
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
  sorImpl.java:45)
  at
 java.lang.reflect.Constructor.newInstance(Constructor.java:525)
  at java.lang.Class.newInstance0(Class.java:374)
  at java.lang.Class.newInstance(Class.java:327)
  at
 java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
  at
  org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
  - locked 0xfaeb4fc8 (a java.lang.Class for
  org.apache.hadoop.fs.FileSystem)
  at
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
  at
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
  at
  org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
 
 
 
  ...elided...
 
 
  Executor task launch worker-0 daemon prio=10 tid=0x01e71800
  nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000]
 java.lang.Thread.State: BLOCKED (on object monitor)
  at
  

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Nishkam Ravi
Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
help. I suspect we will start seeing the ConcurrentModificationException
again. The right fix has gone into Hadoop through 10456. Unfortunately, I
don't have any bright ideas on how to synchronize this at the Spark level
without the risk of deadlocks.


On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com wrote:

 The full jstack would still be useful, but our current working theory is
 that this is due to the fact that Configuration#loadDefaults goes through
 every Configuration object that was ever created (via
 Configuration.REGISTRY) and locks it, thus introducing a dependency from
 new Configuration to old, otherwise unrelated, Configuration objects that
 our locking did not anticipate.

 I have created https://github.com/apache/spark/pull/1409 to hopefully fix
 this bug.


 On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
  talking
   about it on the mailing list was friendlier than randomly (re)opening
  jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing list,
  once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized block of
  code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30,
  a
   org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8,
  a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
   at
  
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
   )
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
  
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
   sorImpl.java:45)
   at
  java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at
  java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
  
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
   at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
   at
   org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
   at
  org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
   at
   org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
   at
   

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Nishkam,

Aaron's fix should prevent two concurrent accesses to getJobConf (and
the Hadoop code therein). But if there is code elsewhere that tries to
mutate the configuration, then I could see how we might still have the
ConcurrentModificationException.

I looked at your patch for HADOOP-10456 and the only example you give
is of the data being accessed inside of getJobConf. Is it accessed
somewhere else too from Spark that you are aware of?

https://issues.apache.org/jira/browse/HADOOP-10456

- Patrick

On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com wrote:
 Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
 help. I suspect we will start seeing the ConcurrentModificationException
 again. The right fix has gone into Hadoop through 10456. Unfortunately, I
 don't have any bright ideas on how to synchronize this at the Spark level
 without the risk of deadlocks.


 On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com wrote:

 The full jstack would still be useful, but our current working theory is
 that this is due to the fact that Configuration#loadDefaults goes through
 every Configuration object that was ever created (via
 Configuration.REGISTRY) and locks it, thus introducing a dependency from
 new Configuration to old, otherwise unrelated, Configuration objects that
 our locking did not anticipate.

 I have created https://github.com/apache/spark/pull/1409 to hopefully fix
 this bug.


 On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
  talking
   about it on the mailing list was friendlier than randomly (re)opening
  jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing list,
  once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized block of
  code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30,
  a
   org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8,
  a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
   at
  
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
   )
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
  
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
   sorImpl.java:45)
   at
  java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at
  java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
  
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
   at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
   at
   

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We use the Hadoop configuration inside of our code executing on Spark as we
need to list out files in the path.  Maybe that is why it is exposed for us.


On Mon, Jul 14, 2014 at 6:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Nishkam,

 Aaron's fix should prevent two concurrent accesses to getJobConf (and
 the Hadoop code therein). But if there is code elsewhere that tries to
 mutate the configuration, then I could see how we might still have the
 ConcurrentModificationException.

 I looked at your patch for HADOOP-10456 and the only example you give
 is of the data being accessed inside of getJobConf. Is it accessed
 somewhere else too from Spark that you are aware of?

 https://issues.apache.org/jira/browse/HADOOP-10456

 - Patrick

 On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com wrote:
  Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
  help. I suspect we will start seeing the ConcurrentModificationException
  again. The right fix has gone into Hadoop through 10456. Unfortunately, I
  don't have any bright ideas on how to synchronize this at the Spark level
  without the risk of deadlocks.
 
 
  On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  The full jstack would still be useful, but our current working theory is
  that this is due to the fact that Configuration#loadDefaults goes
 through
  every Configuration object that was ever created (via
  Configuration.REGISTRY) and locks it, thus introducing a dependency from
  new Configuration to old, otherwise unrelated, Configuration objects
 that
  our locking did not anticipate.
 
  I have created https://github.com/apache/spark/pull/1409 to hopefully
 fix
  this bug.
 
 
  On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey Cody,
  
   This Jstack seems truncated, would you mind giving the entire stack
   trace? For the second thread, for instance, we can't see where the
   lock is being acquired.
  
   - Patrick
  
   On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
   cody.koenin...@mediacrossing.com wrote:
Hi all, just wanted to give a heads up that we're seeing a
 reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
   
If jira is a better place for this, apologies in advance - figured
   talking
about it on the mailing list was friendlier than randomly
 (re)opening
   jira
tickets.
   
I know Gary had mentioned some issues with 1.0.1 on the mailing
 list,
   once
we got a thread dump I wanted to follow up.
   
The thread dump shows the deadlock occurs in the synchronized block
 of
   code
that was changed in HadoopRDD.scala, for the Spark-1097 issue
   
Relevant portions of the thread dump are summarized below, we can
  provide
the whole dump if it's useful.
   
Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object
  0xfae7dc30,
   a
org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object
  0xfaeb4fc8,
   a
java.lang.Class),
  which is held by Executor task launch worker-1
   
   
Executor task launch worker-1:
at
   
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
   
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
   
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
at
   
  
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
)
at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
   
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
   
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at
   
  
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
sorImpl.java:45)
at
   java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at
   java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
   
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
 

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We'll try to run a build tomorrow AM.


On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew and Gary,

 Would you guys be able to test
 https://github.com/apache/spark/pull/1409/files and see if it solves
 your problem?

 - Patrick

 On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash and...@andrewash.com wrote:
  I observed a deadlock here when using the AvroInputFormat as well. The
  short of the issue is that there's one configuration object per JVM, but
  multiple threads, one for each task. If each thread attempts to add a
  configuration option to the Configuration object at once you get issues
  because HashMap isn't thread safe.
 
  More details to come tonight. Thanks!
  On Jul 14, 2014 4:11 PM, Nishkam Ravi nr...@cloudera.com wrote:
 
  HI Patrick, I'm not aware of another place where the access happens, but
  it's possible that it does. The original fix synchronized on the
  broadcastConf object and someone reported the same exception.
 
 
  On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey Nishkam,
  
   Aaron's fix should prevent two concurrent accesses to getJobConf (and
   the Hadoop code therein). But if there is code elsewhere that tries to
   mutate the configuration, then I could see how we might still have the
   ConcurrentModificationException.
  
   I looked at your patch for HADOOP-10456 and the only example you give
   is of the data being accessed inside of getJobConf. Is it accessed
   somewhere else too from Spark that you are aware of?
  
   https://issues.apache.org/jira/browse/HADOOP-10456
  
   - Patrick
  
   On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com
  wrote:
Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object
  would
help. I suspect we will start seeing the
  ConcurrentModificationException
again. The right fix has gone into Hadoop through 10456.
  Unfortunately, I
don't have any bright ideas on how to synchronize this at the Spark
  level
without the risk of deadlocks.
   
   
On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com
 
   wrote:
   
The full jstack would still be useful, but our current working
 theory
  is
that this is due to the fact that Configuration#loadDefaults goes
   through
every Configuration object that was ever created (via
Configuration.REGISTRY) and locks it, thus introducing a dependency
  from
new Configuration to old, otherwise unrelated, Configuration
 objects
   that
our locking did not anticipate.
   
I have created https://github.com/apache/spark/pull/1409 to
 hopefully
   fix
this bug.
   
   
On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell 
 pwend...@gmail.com
wrote:
   
 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire
 stack
 trace? For the second thread, for instance, we can't see where
 the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com wrote:
  Hi all, just wanted to give a heads up that we're seeing a
   reproducible
  deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
 
  If jira is a better place for this, apologies in advance -
 figured
 talking
  about it on the mailing list was friendlier than randomly
   (re)opening
 jira
  tickets.
 
  I know Gary had mentioned some issues with 1.0.1 on the mailing
   list,
 once
  we got a thread dump I wanted to follow up.
 
  The thread dump shows the deadlock occurs in the synchronized
  block
   of
 code
  that was changed in HadoopRDD.scala, for the Spark-1097 issue
 
  Relevant portions of the thread dump are summarized below, we
 can
provide
  the whole dump if it's useful.
 
  Found one Java-level deadlock:
  =
  Executor task launch worker-1:
waiting to lock monitor 0x7f250400c520 (object
0xfae7dc30,
 a
  org.apache.hadoop.co
  nf.Configuration),
which is held by Executor task launch worker-0
  Executor task launch worker-0:
waiting to lock monitor 0x7f2520495620 (object
0xfaeb4fc8,
 a
  java.lang.Class),
which is held by Executor task launch worker-1
 
 
  Executor task launch worker-1:
  at
 

   
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
  - waiting to lock 0xfae7dc30 (a
  org.apache.hadoop.conf.Configuration)
  at
 

   
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
  - locked 0xfaca6ff8 (a java.lang.Class for
  org.apache.hadoop.conf.Configurati
  on)
  at
 

   
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The patch won't solve the problem where two people try to add a
configuration option at the same time, but I think there is currently an
issue where two people can try to initialize the Configuration at the same
time and still run into a ConcurrentModificationException. This at least
reduces (slightly) the scope of the exception although eliminating it may
not be possible.


On Mon, Jul 14, 2014 at 4:35 PM, Gary Malouf malouf.g...@gmail.com wrote:

 We'll try to run a build tomorrow AM.


 On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Andrew and Gary,
 
  Would you guys be able to test
  https://github.com/apache/spark/pull/1409/files and see if it solves
  your problem?
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash and...@andrewash.com
 wrote:
   I observed a deadlock here when using the AvroInputFormat as well. The
   short of the issue is that there's one configuration object per JVM,
 but
   multiple threads, one for each task. If each thread attempts to add a
   configuration option to the Configuration object at once you get issues
   because HashMap isn't thread safe.
  
   More details to come tonight. Thanks!
   On Jul 14, 2014 4:11 PM, Nishkam Ravi nr...@cloudera.com wrote:
  
   HI Patrick, I'm not aware of another place where the access happens,
 but
   it's possible that it does. The original fix synchronized on the
   broadcastConf object and someone reported the same exception.
  
  
   On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell pwend...@gmail.com
   wrote:
  
Hey Nishkam,
   
Aaron's fix should prevent two concurrent accesses to getJobConf
 (and
the Hadoop code therein). But if there is code elsewhere that tries
 to
mutate the configuration, then I could see how we might still have
 the
ConcurrentModificationException.
   
I looked at your patch for HADOOP-10456 and the only example you
 give
is of the data being accessed inside of getJobConf. Is it accessed
somewhere else too from Spark that you are aware of?
   
https://issues.apache.org/jira/browse/HADOOP-10456
   
- Patrick
   
On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com
   wrote:
 Hi Aaron, I'm not sure if synchronizing on an arbitrary lock
 object
   would
 help. I suspect we will start seeing the
   ConcurrentModificationException
 again. The right fix has gone into Hadoop through 10456.
   Unfortunately, I
 don't have any bright ideas on how to synchronize this at the
 Spark
   level
 without the risk of deadlocks.


 On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson 
 ilike...@gmail.com
  
wrote:

 The full jstack would still be useful, but our current working
  theory
   is
 that this is due to the fact that Configuration#loadDefaults goes
through
 every Configuration object that was ever created (via
 Configuration.REGISTRY) and locks it, thus introducing a
 dependency
   from
 new Configuration to old, otherwise unrelated, Configuration
  objects
that
 our locking did not anticipate.

 I have created https://github.com/apache/spark/pull/1409 to
  hopefully
fix
 this bug.


 On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell 
  pwend...@gmail.com
 wrote:

  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire
  stack
  trace? For the second thread, for instance, we can't see where
  the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a
reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance -
  figured
  talking
   about it on the mailing list was friendlier than randomly
(re)opening
  jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the
 mailing
list,
  once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized
   block
of
  code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we
  can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30,
  a
   org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8,
  a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
  

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread scwf

hi,Cody
  i met this issue days before and i post a PR for this( 
https://github.com/apache/spark/pull/1385)
it's very strange that if i synchronize conf it will deadlock but it is ok when 
synchronize initLocalJobConfFuncOpt



Here's the entire jstack output.


On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com 
mailto:pwend...@gmail.com wrote:

Hey Cody,

This Jstack seems truncated, would you mind giving the entire stack
trace? For the second thread, for instance, we can't see where the
lock is being acquired.

- Patrick

On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
cody.koenin...@mediacrossing.com 
mailto:cody.koenin...@mediacrossing.com wrote:
  Hi all, just wanted to give a heads up that we're seeing a reproducible
  deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
 
  If jira is a better place for this, apologies in advance - figured 
talking
  about it on the mailing list was friendlier than randomly (re)opening 
jira
  tickets.
 
  I know Gary had mentioned some issues with 1.0.1 on the mailing list, 
once
  we got a thread dump I wanted to follow up.
 
  The thread dump shows the deadlock occurs in the synchronized block of 
code
  that was changed in HadoopRDD.scala, for the Spark-1097 issue
 
  Relevant portions of the thread dump are summarized below, we can provide
  the whole dump if it's useful.
 
  Found one Java-level deadlock:
  =
  Executor task launch worker-1:
waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, 
a
  org.apache.hadoop.co http://org.apache.hadoop.co
  nf.Configuration),
which is held by Executor task launch worker-0
  Executor task launch worker-0:
waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, 
a
  java.lang.Class),
which is held by Executor task launch worker-1
 
 
  Executor task launch worker-1:
  at
  
org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
  - waiting to lock 0xfae7dc30 (a
  org.apache.hadoop.conf.Configuration)
  at
  
org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
  - locked 0xfaca6ff8 (a java.lang.Class for
  org.apache.hadoop.conf.Configurati
  on)
  at
  
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
  at
  
org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
  )
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
  
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
  
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at
  
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
  sorImpl.java:45)
  at 
java.lang.reflect.Constructor.newInstance(Constructor.java:525)
  at java.lang.Class.newInstance0(Class.java:374)
  at java.lang.Class.newInstance(Class.java:327)
  at 
java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
  at
  org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
  - locked 0xfaeb4fc8 (a java.lang.Class for
  org.apache.hadoop.fs.FileSystem)
  at
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
  at
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
  at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
  at
  org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
  at
  
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
  at
  
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
  
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
 
 
 
   

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I'm not sure either of those PRs will fix the concurrent adds to
Configuration issue I observed. I've got a stack trace and writeup I'll
share in an hour or two (traveling today).
On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:

 hi,Cody
   i met this issue days before and i post a PR for this(
 https://github.com/apache/spark/pull/1385)
 it's very strange that if i synchronize conf it will deadlock but it is ok
 when synchronize initLocalJobConfFuncOpt


  Here's the entire jstack output.


 On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire stack
 trace? For the second thread, for instance, we can't see where the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com mailto:cody.koeninger@
 mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a
 reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
 talking
   about it on the mailing list was friendlier than randomly
 (re)opening jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing
 list, once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized
 block of code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30, a
   org.apache.hadoop.co http://org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8, a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
   org.apache.hadoop.conf.Configuration.reloadConfiguration(
 Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
   org.apache.hadoop.conf.Configuration.addDefaultResource(
 Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
   org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
 HdfsConfiguration.java:34)
   at
   org.apache.hadoop.hdfs.DistributedFileSystem.clinit
 (DistributedFileSystem.java:110
   )
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at
   sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
 DelegatingConstructorAcces
   sorImpl.java:45)
   at java.lang.reflect.Constructor.
 newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at java.util.ServiceLoader$LazyIterator.next(
 ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(
 FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
   org.apache.hadoop.fs.FileSystem.getFileSystemClass(
 FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(
 FileSystem.java:2392)
   at org.apache.hadoop.fs.FileSystem.access$200(
 FileSystem.java:89)
   at
   org.apache.hadoop.fs.FileSystem$Cache.getInternal(
 FileSystem.java:2431)
   at org.apache.hadoop.fs.FileSystem$Cache.get(
 FileSystem.java:2413)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:368)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:167)
   at
   org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
 JobConf.java:587)
   at
   org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
 FileInputFormat.java:315)
   at
   

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
immediate priority is addressing regressions between these two
releases.

On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote:
 I'm not sure either of those PRs will fix the concurrent adds to
 Configuration issue I observed. I've got a stack trace and writeup I'll
 share in an hour or two (traveling today).
 On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:

 hi,Cody
   i met this issue days before and i post a PR for this(
 https://github.com/apache/spark/pull/1385)
 it's very strange that if i synchronize conf it will deadlock but it is ok
 when synchronize initLocalJobConfFuncOpt


  Here's the entire jstack output.


 On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire stack
 trace? For the second thread, for instance, we can't see where the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com mailto:cody.koeninger@
 mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a
 reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
 talking
   about it on the mailing list was friendlier than randomly
 (re)opening jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing
 list, once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized
 block of code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30, a
   org.apache.hadoop.co http://org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8, a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
   org.apache.hadoop.conf.Configuration.reloadConfiguration(
 Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
   org.apache.hadoop.conf.Configuration.addDefaultResource(
 Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
   org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
 HdfsConfiguration.java:34)
   at
   org.apache.hadoop.hdfs.DistributedFileSystem.clinit
 (DistributedFileSystem.java:110
   )
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at
   sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
 DelegatingConstructorAcces
   sorImpl.java:45)
   at java.lang.reflect.Constructor.
 newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at java.util.ServiceLoader$LazyIterator.next(
 ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(
 FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
   org.apache.hadoop.fs.FileSystem.getFileSystemClass(
 FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(
 FileSystem.java:2392)
   at org.apache.hadoop.fs.FileSystem.access$200(
 FileSystem.java:89)
   at
   org.apache.hadoop.fs.FileSystem$Cache.getInternal(
 FileSystem.java:2431)
   at org.apache.hadoop.fs.FileSystem$Cache.get(
 FileSystem.java:2413)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:368)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:167)
   at
   

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I don't believe mine is a regression. But it is related to thread safety on
Hadoop Configuration objects. Should I start a new thread?
On Jul 15, 2014 12:55 AM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
 immediate priority is addressing regressions between these two
 releases.

 On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote:
  I'm not sure either of those PRs will fix the concurrent adds to
  Configuration issue I observed. I've got a stack trace and writeup I'll
  share in an hour or two (traveling today).
  On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:
 
  hi,Cody
i met this issue days before and i post a PR for this(
  https://github.com/apache/spark/pull/1385)
  it's very strange that if i synchronize conf it will deadlock but it is
 ok
  when synchronize initLocalJobConfFuncOpt
 
 
   Here's the entire jstack output.
 
 
  On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
 
  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com mailto:cody.koeninger@
  mediacrossing.com wrote:
Hi all, just wanted to give a heads up that we're seeing a
  reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
   
If jira is a better place for this, apologies in advance -
 figured
  talking
about it on the mailing list was friendlier than randomly
  (re)opening jira
tickets.
   
I know Gary had mentioned some issues with 1.0.1 on the mailing
  list, once
we got a thread dump I wanted to follow up.
   
The thread dump shows the deadlock occurs in the synchronized
  block of code
that was changed in HadoopRDD.scala, for the Spark-1097 issue
   
Relevant portions of the thread dump are summarized below, we
 can
  provide
the whole dump if it's useful.
   
Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object
  0xfae7dc30, a
org.apache.hadoop.co http://org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object
  0xfaeb4fc8, a
java.lang.Class),
  which is held by Executor task launch worker-1
   
   
Executor task launch worker-1:
at
org.apache.hadoop.conf.Configuration.reloadConfiguration(
  Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
org.apache.hadoop.conf.Configuration.addDefaultResource(
  Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
  HdfsConfiguration.java:34)
at
org.apache.hadoop.hdfs.DistributedFileSystem.clinit
  (DistributedFileSystem.java:110
)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
  DelegatingConstructorAcces
sorImpl.java:45)
at java.lang.reflect.Constructor.
  newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(
  ServiceLoader.java:373)
at
 java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(
  FileSystem.java:2364)
- locked 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(
  FileSystem.java:2375)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(
  FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(
  FileSystem.java:89)
at

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Andrew,

Yeah, that would be preferable. Definitely worth investigating both,
but the regression is more pressing at the moment.

- Patrick

On Mon, Jul 14, 2014 at 10:02 PM, Andrew Ash and...@andrewash.com wrote:
 I don't believe mine is a regression. But it is related to thread safety on
 Hadoop Configuration objects. Should I start a new thread?
 On Jul 15, 2014 12:55 AM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
 immediate priority is addressing regressions between these two
 releases.

 On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote:
  I'm not sure either of those PRs will fix the concurrent adds to
  Configuration issue I observed. I've got a stack trace and writeup I'll
  share in an hour or two (traveling today).
  On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:
 
  hi,Cody
i met this issue days before and i post a PR for this(
  https://github.com/apache/spark/pull/1385)
  it's very strange that if i synchronize conf it will deadlock but it is
 ok
  when synchronize initLocalJobConfFuncOpt
 
 
   Here's the entire jstack output.
 
 
  On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
 
  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com mailto:cody.koeninger@
  mediacrossing.com wrote:
Hi all, just wanted to give a heads up that we're seeing a
  reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
   
If jira is a better place for this, apologies in advance -
 figured
  talking
about it on the mailing list was friendlier than randomly
  (re)opening jira
tickets.
   
I know Gary had mentioned some issues with 1.0.1 on the mailing
  list, once
we got a thread dump I wanted to follow up.
   
The thread dump shows the deadlock occurs in the synchronized
  block of code
that was changed in HadoopRDD.scala, for the Spark-1097 issue
   
Relevant portions of the thread dump are summarized below, we
 can
  provide
the whole dump if it's useful.
   
Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object
  0xfae7dc30, a
org.apache.hadoop.co http://org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object
  0xfaeb4fc8, a
java.lang.Class),
  which is held by Executor task launch worker-1
   
   
Executor task launch worker-1:
at
org.apache.hadoop.conf.Configuration.reloadConfiguration(
  Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
org.apache.hadoop.conf.Configuration.addDefaultResource(
  Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
  HdfsConfiguration.java:34)
at
org.apache.hadoop.hdfs.DistributedFileSystem.clinit
  (DistributedFileSystem.java:110
)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
  DelegatingConstructorAcces
sorImpl.java:45)
at java.lang.reflect.Constructor.
  newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(
  ServiceLoader.java:373)
at
 java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(
  FileSystem.java:2364)
- locked 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(
  FileSystem.java:2375)