[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources

2016-02-16 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150055#comment-15150055
 ] 

Zhen Peng commented on SPARK-8119:
--

Hi [~srowen], I think it's really a serious bug, do you have any reason for not 
back-porting it to 1.4.x?

> HeartbeatReceiver should not adjust application executor resources
> --
>
> Key: SPARK-8119
> URL: https://issues.apache.org/jira/browse/SPARK-8119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.0
>
>
> DynamicAllocation will set the total executor to a little number when it 
> wants to kill some executors.
> But in no-DynamicAllocation scenario, Spark will also set the total executor.
> So it will cause such problem: sometimes an executor fails down, there is no 
> more executor which will be pull up by spark.
> === EDIT by andrewor14 ===
> The issue is that the AM forgets about the original number of executors it 
> wants after calling sc.killExecutor. Even if dynamic allocation is not 
> enabled, this is still possible because of heartbeat timeouts.
> I think the problem is that sc.killExecutor is used incorrectly in 
> HeartbeatReceiver. The intention of the method is to permanently adjust the 
> number of executors the application will get. In HeartbeatReceiver, however, 
> this is used as a best-effort mechanism to ensure that the timed out executor 
> is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-08-31 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724609#comment-14724609
 ] 

Zhen Peng commented on SPARK-6235:
--

Hi [~rxin], is there any update for this issue?

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-938) OpenStack Swift Storage Support

2014-06-02 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016231#comment-14016231
 ] 

Zhen Peng commented on SPARK-938:
-

hi, is there any follow-up on this issue?


 OpenStack Swift Storage Support
 ---

 Key: SPARK-938
 URL: https://issues.apache.org/jira/browse/SPARK-938
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, Examples, Input/Output, Spark Core
Affects Versions: 0.8.1
Reporter: Murali Raju
Priority: Minor

 This issue is to track OpenStack Swift Storage support (development in 
 progress) in addition to S3 for Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM

2014-05-27 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009477#comment-14009477
 ] 

Zhen Peng commented on SPARK-1928:
--

https://github.com/apache/spark/pull/883 

 DAGScheduler suspended by local task OOM
 

 Key: SPARK-1928
 URL: https://issues.apache.org/jira/browse/SPARK-1928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Zhen Peng
 Fix For: 1.0.0


 DAGScheduler does not handle local task OOM properly, and will wait for the 
 job result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM

2014-05-27 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009532#comment-14009532
 ] 

Zhen Peng commented on SPARK-1928:
--

[~gq] I met this case in our local mode spark streaming application. 
And in the UT, I have added a test case to simulate this.

 DAGScheduler suspended by local task OOM
 

 Key: SPARK-1928
 URL: https://issues.apache.org/jira/browse/SPARK-1928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Zhen Peng
 Fix For: 1.0.0


 DAGScheduler does not handle local task OOM properly, and will wait for the 
 job result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1929) DAGScheduler suspended by local task OOM

2014-05-26 Thread Zhen Peng (JIRA)
Zhen Peng created SPARK-1929:


 Summary: DAGScheduler suspended by local task OOM
 Key: SPARK-1929
 URL: https://issues.apache.org/jira/browse/SPARK-1929
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Zhen Peng
 Fix For: 1.0.0


DAGScheduler does not handle local task OOM properly, and will wait for the job 
result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1901) Standalone worker update exector's state ahead of executor process exit

2014-05-22 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005749#comment-14005749
 ] 

Zhen Peng commented on SPARK-1901:
--

https://github.com/apache/spark/pull/854

 Standalone worker update exector's state ahead of executor process exit
 ---

 Key: SPARK-1901
 URL: https://issues.apache.org/jira/browse/SPARK-1901
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: spark-1.0 rc10
Reporter: Zhen Peng
 Fix For: 1.0.0


 Standalone worker updates executor's state prematurely, making the resource 
 status in an inconsistent state until the executor process really died.
 In our cluster, we found this situation may cause new submitted applications 
 removed by Master for launching executor fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1886) workers keep dying for uncaught exception of executor id not found

2014-05-19 Thread Zhen Peng (JIRA)
Zhen Peng created SPARK-1886:


 Summary: workers keep dying for uncaught exception of executor id 
not found 
 Key: SPARK-1886
 URL: https://issues.apache.org/jira/browse/SPARK-1886
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: spark-1.0-rc8
Reporter: Zhen Peng
 Fix For: 1.0.0


14/05/19 15:43:30 ERROR OneForOneStrategy: key not found: 
app-20140519154218-0132/6
java.util.NoSuchElementException: key not found: app-20140519154218-0132/6
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:266)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1886) workers keep dying for uncaught exception of executor id not found

2014-05-19 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002688#comment-14002688
 ] 

Zhen Peng commented on SPARK-1886:
--

https://github.com/apache/spark/pull/827

 workers keep dying for uncaught exception of executor id not found 
 ---

 Key: SPARK-1886
 URL: https://issues.apache.org/jira/browse/SPARK-1886
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: spark-1.0-rc8
Reporter: Zhen Peng
 Fix For: 1.0.0


 14/05/19 15:43:30 ERROR OneForOneStrategy: key not found: 
 app-20140519154218-0132/6
 java.util.NoSuchElementException: key not found: app-20140519154218-0132/6
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:266)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)