[jira] [Updated] (SPARK-22589) Subscribe to multiple roles in Mesos

2017-11-23 Thread Fabiano Francesconi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabiano Francesconi updated SPARK-22589:

Description: 
Mesos offers the capability of [subscribing to multiple 
roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
Spark could easily be extended to opt-in for this specific capability.

>From my understanding, this is the [Spark source 
>code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
> that regulates the subscription to the role. I wonder on whether just passing 
>a comma-separated list of frameworks (hence, splitting on that string) would 
>already be sufficient to leverage this capability.

Is there any side-effect that this change will cause?

  was:
Mesos offers the capability of [subscribing to multiple 
roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
Spark could easily be extended to opt-in for this specific capability.

>From my understanding, this is the [Spark source 
>code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
> that regulates the subscription to the role. I wonder on whether just passing 
>a comma-separated list of frameworks (hence, splitting on that string) would 
>already be sufficient to leverage this capability.


> Subscribe to multiple roles in Mesos
> 
>
> Key: SPARK-22589
> URL: https://issues.apache.org/jira/browse/SPARK-22589
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Fabiano Francesconi
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Mesos offers the capability of [subscribing to multiple 
> roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
> Spark could easily be extended to opt-in for this specific capability.
> From my understanding, this is the [Spark source 
> code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
>  that regulates the subscription to the role. I wonder on whether just 
> passing a comma-separated list of frameworks (hence, splitting on that 
> string) would already be sufficient to leverage this capability.
> Is there any side-effect that this change will cause?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22589) Subscribe to multiple roles in Mesos

2017-11-23 Thread Fabiano Francesconi (JIRA)
Fabiano Francesconi created SPARK-22589:
---

 Summary: Subscribe to multiple roles in Mesos
 Key: SPARK-22589
 URL: https://issues.apache.org/jira/browse/SPARK-22589
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.2
Reporter: Fabiano Francesconi


Mesos offers the capability of [subscribing to multiple 
roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
Spark could easily be extended to opt-in for this specific capability.

>From my understanding, this is the [Spark source 
>code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
> that regulates the subscription to the role. I wonder on whether just passing 
>a comma-separated list of frameworks (hence, splitting on that string) would 
>already be sufficient to leverage this capability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15558) Deadlock when retreiving shuffled cached data

2016-05-26 Thread Fabiano Francesconi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabiano Francesconi closed SPARK-15558.
---
   Resolution: Cannot Reproduce
Fix Version/s: 2.0.0

> Deadlock when retreiving shuffled cached data
> -
>
> Key: SPARK-15558
> URL: https://issues.apache.org/jira/browse/SPARK-15558
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Fabiano Francesconi
> Fix For: 2.0.0
>
> Attachments: screenshot-1.png
>
>
> Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached 
> data from another host. The job I am currently executing is fetching data 
> using async actions and persisting these RDDs into main memory (they all 
> fit). Later on, at the point in which it is currently hanging, the 
> application is retrieving this cached data but it hangs. The application, 
> once the timeout set in the Await.results call is met, crashes.
> This problem is reproducible at every executing, although the point in which 
> it hangs it is not.
> I have also tried activating:
> {code}
> spark.memory.useLegacyMode=true
> {code}
> as mentioned in SPARK-13566 guessing a similar deadlock as the one given 
> between MemoryStore and BlockManager. Unfortunately, this didn't help.
> The only plausible (albeit debatable) solution would be to use speculation 
> mode.
> Configuration:
> {code}
> /usr/local/tl/spark-latest/bin/spark-submit \
>   --executor-memory 80G \
>   --total-executor-cores 90 \
>   --driver-memory 8G \
> {code}
> Stack trace:
> {code}
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 
> daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on 
> condition [0x7f9946bfb000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 
> daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on 
> condition [0x7f98c86b6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 
> tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7fab3f081300> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> at 

[jira] [Closed] (SPARK-15558) Deadlock when retreiving shuffled cached data

2016-05-26 Thread Fabiano Francesconi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabiano Francesconi closed SPARK-15558.
---
Resolution: Fixed

Tried with Spark-2.0.0 compiled from the 2.x branch and this problem seems to 
be gone. Thank you!

> Deadlock when retreiving shuffled cached data
> -
>
> Key: SPARK-15558
> URL: https://issues.apache.org/jira/browse/SPARK-15558
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Fabiano Francesconi
> Attachments: screenshot-1.png
>
>
> Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached 
> data from another host. The job I am currently executing is fetching data 
> using async actions and persisting these RDDs into main memory (they all 
> fit). Later on, at the point in which it is currently hanging, the 
> application is retrieving this cached data but it hangs. The application, 
> once the timeout set in the Await.results call is met, crashes.
> This problem is reproducible at every executing, although the point in which 
> it hangs it is not.
> I have also tried activating:
> {code}
> spark.memory.useLegacyMode=true
> {code}
> as mentioned in SPARK-13566 guessing a similar deadlock as the one given 
> between MemoryStore and BlockManager. Unfortunately, this didn't help.
> The only plausible (albeit debatable) solution would be to use speculation 
> mode.
> Configuration:
> {code}
> /usr/local/tl/spark-latest/bin/spark-submit \
>   --executor-memory 80G \
>   --total-executor-cores 90 \
>   --driver-memory 8G \
> {code}
> Stack trace:
> {code}
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 
> daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on 
> condition [0x7f9946bfb000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 
> daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on 
> condition [0x7f98c86b6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 
> tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7fab3f081300> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> at 

[jira] [Reopened] (SPARK-15558) Deadlock when retreiving shuffled cached data

2016-05-26 Thread Fabiano Francesconi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabiano Francesconi reopened SPARK-15558:
-

> Deadlock when retreiving shuffled cached data
> -
>
> Key: SPARK-15558
> URL: https://issues.apache.org/jira/browse/SPARK-15558
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Fabiano Francesconi
> Attachments: screenshot-1.png
>
>
> Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached 
> data from another host. The job I am currently executing is fetching data 
> using async actions and persisting these RDDs into main memory (they all 
> fit). Later on, at the point in which it is currently hanging, the 
> application is retrieving this cached data but it hangs. The application, 
> once the timeout set in the Await.results call is met, crashes.
> This problem is reproducible at every executing, although the point in which 
> it hangs it is not.
> I have also tried activating:
> {code}
> spark.memory.useLegacyMode=true
> {code}
> as mentioned in SPARK-13566 guessing a similar deadlock as the one given 
> between MemoryStore and BlockManager. Unfortunately, this didn't help.
> The only plausible (albeit debatable) solution would be to use speculation 
> mode.
> Configuration:
> {code}
> /usr/local/tl/spark-latest/bin/spark-submit \
>   --executor-memory 80G \
>   --total-executor-cores 90 \
>   --driver-memory 8G \
> {code}
> Stack trace:
> {code}
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 
> daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on 
> condition [0x7f9946bfb000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 
> daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on 
> condition [0x7f98c86b6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 
> tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7fab3f081300> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
> at 

[jira] [Updated] (SPARK-15558) Deadlock when retreiving shuffled cached data

2016-05-26 Thread Fabiano Francesconi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabiano Francesconi updated SPARK-15558:

Attachment: screenshot-1.png

> Deadlock when retreiving shuffled cached data
> -
>
> Key: SPARK-15558
> URL: https://issues.apache.org/jira/browse/SPARK-15558
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Fabiano Francesconi
> Attachments: screenshot-1.png
>
>
> Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached 
> data from another host. The job I am currently executing is fetching data 
> using async actions and persisting these RDDs into main memory (they all 
> fit). Later on, at the point in which it is currently hanging, the 
> application is retrieving this cached data but it hangs. The application, 
> once the timeout set in the Await.results call is met, crashes.
> This problem is reproducible at every executing, although the point in which 
> it hangs it is not.
> I have also tried activating:
> {code}
> spark.memory.useLegacyMode=true
> {code}
> as mentioned in SPARK-13566 guessing a similar deadlock as the one given 
> between MemoryStore and BlockManager. Unfortunately, this didn't help.
> The only plausible (albeit debatable) solution would be to use speculation 
> mode.
> Configuration:
> {code}
> /usr/local/tl/spark-latest/bin/spark-submit \
>   --executor-memory 80G \
>   --total-executor-cores 90 \
>   --driver-memory 8G \
> {code}
> Stack trace:
> {code}
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 
> daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on 
> condition [0x7f9946bfb000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 
> daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on 
> condition [0x7f98c86b6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 
> tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7fab3f081300> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
> at 

[jira] [Created] (SPARK-15558) Deadlock when retreiving shuffled cached data

2016-05-26 Thread Fabiano Francesconi (JIRA)
Fabiano Francesconi created SPARK-15558:
---

 Summary: Deadlock when retreiving shuffled cached data
 Key: SPARK-15558
 URL: https://issues.apache.org/jira/browse/SPARK-15558
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Fabiano Francesconi


Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached data 
from another host. The job I am currently executing is fetching data using 
async actions and persisting these RDDs into main memory (they all fit). Later 
on, at the point in which it is currently hanging, the application is 
retrieving this cached data but it hangs. The application, once the timeout set 
in the Await.results call is met, crashes.

This problem is reproducible at every executing, although the point in which it 
hangs it is not.

I have also tried activating:
{code}
spark.memory.useLegacyMode=true
{code}

as mentioned in SPARK-13566 guessing a similar deadlock as the one given 
between MemoryStore and BlockManager. Unfortunately, this didn't help.

The only plausible (albeit debatable) solution would be to use speculation mode.

Configuration:
{code}
/usr/local/tl/spark-latest/bin/spark-submit \
  --executor-memory 80G \
  --total-executor-cores 90 \
  --driver-memory 8G \
{code}

{code}
"sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 daemon 
prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on condition 
[0x7f9946bfb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7f9b541a6570> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at 
scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 daemon 
prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on condition 
[0x7f98c86b6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7f9b541a6570> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"Executor task launch worker-43" #236 daemon prio=5 os_prio=0 
tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7fab3f081300> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 

[jira] [Commented] (SPARK-12583) spark shuffle fails with mesos after 2mins

2016-02-12 Thread Fabiano Francesconi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144540#comment-15144540
 ] 

Fabiano Francesconi commented on SPARK-12583:
-

+1 

is there a follow-up on this issue? Looks like that, starting from spark 1.6.0, 
the external shuffle manager is not usable at all on Mesos.

> spark shuffle fails with mesos after 2mins
> --
>
> Key: SPARK-12583
> URL: https://issues.apache.org/jira/browse/SPARK-12583
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>
> See user mailing list "Executor deregistered after 2mins" for more details.
> As of 1.6, the driver registers with each shuffle manager via  
> MesosExternalShuffleClient.  Once this disconnects, the shuffle manager 
> automatically cleans up the data associate with that driver.
> However, the connection is terminated before this happens as it's idle. 
> Looking at a packet trace, after 120secs the shuffle manager is sending a FIN 
> packet to the driver.   The only way to delay this is to increase 
> spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
> I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with 
> newbie Scala skills to call the TransportContext call with 
> closeIdleConnections "false" and this didn't help (hadn't done the network 
> trace first).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org