[jira] [Updated] (SPARK-22589) Subscribe to multiple roles in Mesos
[ https://issues.apache.org/jira/browse/SPARK-22589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabiano Francesconi updated SPARK-22589: Description: Mesos offers the capability of [subscribing to multiple roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that Spark could easily be extended to opt-in for this specific capability. >From my understanding, this is the [Spark source >code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94] > that regulates the subscription to the role. I wonder on whether just passing >a comma-separated list of frameworks (hence, splitting on that string) would >already be sufficient to leverage this capability. Is there any side-effect that this change will cause? was: Mesos offers the capability of [subscribing to multiple roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that Spark could easily be extended to opt-in for this specific capability. >From my understanding, this is the [Spark source >code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94] > that regulates the subscription to the role. I wonder on whether just passing >a comma-separated list of frameworks (hence, splitting on that string) would >already be sufficient to leverage this capability. > Subscribe to multiple roles in Mesos > > > Key: SPARK-22589 > URL: https://issues.apache.org/jira/browse/SPARK-22589 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Fabiano Francesconi > Original Estimate: 48h > Remaining Estimate: 48h > > Mesos offers the capability of [subscribing to multiple > roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that > Spark could easily be extended to opt-in for this specific capability. > From my understanding, this is the [Spark source > code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94] > that regulates the subscription to the role. I wonder on whether just > passing a comma-separated list of frameworks (hence, splitting on that > string) would already be sufficient to leverage this capability. > Is there any side-effect that this change will cause? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22589) Subscribe to multiple roles in Mesos
Fabiano Francesconi created SPARK-22589: --- Summary: Subscribe to multiple roles in Mesos Key: SPARK-22589 URL: https://issues.apache.org/jira/browse/SPARK-22589 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.2.0, 2.1.2 Reporter: Fabiano Francesconi Mesos offers the capability of [subscribing to multiple roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that Spark could easily be extended to opt-in for this specific capability. >From my understanding, this is the [Spark source >code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94] > that regulates the subscription to the role. I wonder on whether just passing >a comma-separated list of frameworks (hence, splitting on that string) would >already be sufficient to leverage this capability. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15558) Deadlock when retreiving shuffled cached data
[ https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabiano Francesconi closed SPARK-15558. --- Resolution: Cannot Reproduce Fix Version/s: 2.0.0 > Deadlock when retreiving shuffled cached data > - > > Key: SPARK-15558 > URL: https://issues.apache.org/jira/browse/SPARK-15558 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Fabiano Francesconi > Fix For: 2.0.0 > > Attachments: screenshot-1.png > > > Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached > data from another host. The job I am currently executing is fetching data > using async actions and persisting these RDDs into main memory (they all > fit). Later on, at the point in which it is currently hanging, the > application is retrieving this cached data but it hangs. The application, > once the timeout set in the Await.results call is met, crashes. > This problem is reproducible at every executing, although the point in which > it hangs it is not. > I have also tried activating: > {code} > spark.memory.useLegacyMode=true > {code} > as mentioned in SPARK-13566 guessing a similar deadlock as the one given > between MemoryStore and BlockManager. Unfortunately, this didn't help. > The only plausible (albeit debatable) solution would be to use speculation > mode. > Configuration: > {code} > /usr/local/tl/spark-latest/bin/spark-submit \ > --executor-memory 80G \ > --total-executor-cores 90 \ > --driver-memory 8G \ > {code} > Stack trace: > {code} > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 > daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on > condition [0x7f9946bfb000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at > scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 > daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on > condition [0x7f98c86b6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 > tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fab3f081300> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) > at
[jira] [Closed] (SPARK-15558) Deadlock when retreiving shuffled cached data
[ https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabiano Francesconi closed SPARK-15558. --- Resolution: Fixed Tried with Spark-2.0.0 compiled from the 2.x branch and this problem seems to be gone. Thank you! > Deadlock when retreiving shuffled cached data > - > > Key: SPARK-15558 > URL: https://issues.apache.org/jira/browse/SPARK-15558 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Fabiano Francesconi > Attachments: screenshot-1.png > > > Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached > data from another host. The job I am currently executing is fetching data > using async actions and persisting these RDDs into main memory (they all > fit). Later on, at the point in which it is currently hanging, the > application is retrieving this cached data but it hangs. The application, > once the timeout set in the Await.results call is met, crashes. > This problem is reproducible at every executing, although the point in which > it hangs it is not. > I have also tried activating: > {code} > spark.memory.useLegacyMode=true > {code} > as mentioned in SPARK-13566 guessing a similar deadlock as the one given > between MemoryStore and BlockManager. Unfortunately, this didn't help. > The only plausible (albeit debatable) solution would be to use speculation > mode. > Configuration: > {code} > /usr/local/tl/spark-latest/bin/spark-submit \ > --executor-memory 80G \ > --total-executor-cores 90 \ > --driver-memory 8G \ > {code} > Stack trace: > {code} > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 > daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on > condition [0x7f9946bfb000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at > scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 > daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on > condition [0x7f98c86b6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 > tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fab3f081300> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) > at
[jira] [Reopened] (SPARK-15558) Deadlock when retreiving shuffled cached data
[ https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabiano Francesconi reopened SPARK-15558: - > Deadlock when retreiving shuffled cached data > - > > Key: SPARK-15558 > URL: https://issues.apache.org/jira/browse/SPARK-15558 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Fabiano Francesconi > Attachments: screenshot-1.png > > > Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached > data from another host. The job I am currently executing is fetching data > using async actions and persisting these RDDs into main memory (they all > fit). Later on, at the point in which it is currently hanging, the > application is retrieving this cached data but it hangs. The application, > once the timeout set in the Await.results call is met, crashes. > This problem is reproducible at every executing, although the point in which > it hangs it is not. > I have also tried activating: > {code} > spark.memory.useLegacyMode=true > {code} > as mentioned in SPARK-13566 guessing a similar deadlock as the one given > between MemoryStore and BlockManager. Unfortunately, this didn't help. > The only plausible (albeit debatable) solution would be to use speculation > mode. > Configuration: > {code} > /usr/local/tl/spark-latest/bin/spark-submit \ > --executor-memory 80G \ > --total-executor-cores 90 \ > --driver-memory 8G \ > {code} > Stack trace: > {code} > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 > daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on > condition [0x7f9946bfb000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at > scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 > daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on > condition [0x7f98c86b6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 > tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fab3f081300> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) > at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) > at
[jira] [Updated] (SPARK-15558) Deadlock when retreiving shuffled cached data
[ https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabiano Francesconi updated SPARK-15558: Attachment: screenshot-1.png > Deadlock when retreiving shuffled cached data > - > > Key: SPARK-15558 > URL: https://issues.apache.org/jira/browse/SPARK-15558 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Fabiano Francesconi > Attachments: screenshot-1.png > > > Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached > data from another host. The job I am currently executing is fetching data > using async actions and persisting these RDDs into main memory (they all > fit). Later on, at the point in which it is currently hanging, the > application is retrieving this cached data but it hangs. The application, > once the timeout set in the Await.results call is met, crashes. > This problem is reproducible at every executing, although the point in which > it hangs it is not. > I have also tried activating: > {code} > spark.memory.useLegacyMode=true > {code} > as mentioned in SPARK-13566 guessing a similar deadlock as the one given > between MemoryStore and BlockManager. Unfortunately, this didn't help. > The only plausible (albeit debatable) solution would be to use speculation > mode. > Configuration: > {code} > /usr/local/tl/spark-latest/bin/spark-submit \ > --executor-memory 80G \ > --total-executor-cores 90 \ > --driver-memory 8G \ > {code} > Stack trace: > {code} > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 > daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on > condition [0x7f9946bfb000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at > scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 > daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on > condition [0x7f98c86b6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 > tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fab3f081300> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) > at > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) > at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) > at
[jira] [Created] (SPARK-15558) Deadlock when retreiving shuffled cached data
Fabiano Francesconi created SPARK-15558: --- Summary: Deadlock when retreiving shuffled cached data Key: SPARK-15558 URL: https://issues.apache.org/jira/browse/SPARK-15558 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: Fabiano Francesconi Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached data from another host. The job I am currently executing is fetching data using async actions and persisting these RDDs into main memory (they all fit). Later on, at the point in which it is currently hanging, the application is retrieving this cached data but it hangs. The application, once the timeout set in the Await.results call is met, crashes. This problem is reproducible at every executing, although the point in which it hangs it is not. I have also tried activating: {code} spark.memory.useLegacyMode=true {code} as mentioned in SPARK-13566 guessing a similar deadlock as the one given between MemoryStore and BlockManager. Unfortunately, this didn't help. The only plausible (albeit debatable) solution would be to use speculation mode. Configuration: {code} /usr/local/tl/spark-latest/bin/spark-submit \ --executor-memory 80G \ --total-executor-cores 90 \ --driver-memory 8G \ {code} {code} "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on condition [0x7f9946bfb000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f9b541a6570> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on condition [0x7f98c86b6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f9b541a6570> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7fab3f081300> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:630) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at
[jira] [Commented] (SPARK-12583) spark shuffle fails with mesos after 2mins
[ https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144540#comment-15144540 ] Fabiano Francesconi commented on SPARK-12583: - +1 is there a follow-up on this issue? Looks like that, starting from spark 1.6.0, the external shuffle manager is not usable at all on Mesos. > spark shuffle fails with mesos after 2mins > -- > > Key: SPARK-12583 > URL: https://issues.apache.org/jira/browse/SPARK-12583 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.0 >Reporter: Adrian Bridgett > > See user mailing list "Executor deregistered after 2mins" for more details. > As of 1.6, the driver registers with each shuffle manager via > MesosExternalShuffleClient. Once this disconnects, the shuffle manager > automatically cleans up the data associate with that driver. > However, the connection is terminated before this happens as it's idle. > Looking at a packet trace, after 120secs the shuffle manager is sending a FIN > packet to the driver. The only way to delay this is to increase > spark.shuffle.io.connectionTimeout=3600s on the shuffle manager. > I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with > newbie Scala skills to call the TransportContext call with > closeIdleConnections "false" and this didn't help (hadn't done the network > trace first). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org