[jira] [Commented] (SPARK-24818) Ensure all the barrier tasks in the same stage are launched together

2019-02-20 Thread luzengxiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773729#comment-16773729
 ] 

luzengxiang commented on SPARK-24818:
-

"cannot fulfill task locality requirements" keeps happening! 

> Ensure all the barrier tasks in the same stage are launched together
> 
>
> Key: SPARK-24818
> URL: https://issues.apache.org/jira/browse/SPARK-24818
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> When some executors/hosts are blacklisted, it may happen that only a part of 
> the tasks in the same barrier stage can be launched. We shall detect the case 
> and revert the allocated resource offers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-18 Thread luzengxiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang resolved SPARK-26886.
-
Resolution: Won't Do

> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote}val nothing = inputData.barrier().mapPartitions
>  {_ => 
>  val barrierTask = BarrierTaskContext.get()
>  // save data to disk barrierTask.barrier()
>  barrierTask.barrier()
>  // launch external process, eg MPI Task + TensorFlow
>  }
> {quote}
>  
> The problem is that external process remains running when spark task is 
> killed manually. This Jira is the place to talk about properly terminating 
> external processes launched by spark worker, when spark task is killed or 
> interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-17 Thread luzengxiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang updated SPARK-26886:

Issue Type: Story  (was: New JIRA Project)

> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote}val nothing = inputData.barrier().mapPartitions
>  {_ => 
>  val barrierTask = BarrierTaskContext.get()
>  // save data to disk barrierTask.barrier()
>  barrierTask.barrier()
>  // launch external process, eg MPI Task + TensorFlow
>  }
> {quote}
>  
> The problem is that external process remains running when spark task is 
> killed manually. This Jira is the place to talk about properly terminating 
> external processes launched by spark worker, when spark task is killed or 
> interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-15 Thread luzengxiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769165#comment-16769165
 ] 

luzengxiang commented on SPARK-26886:
-

[~mengxr] Let's discuss about it.

> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote}val nothing = inputData.barrier().mapPartitions
>  {_ => 
>  val barrierTask = BarrierTaskContext.get()
>  // save data to disk barrierTask.barrier()
>  barrierTask.barrier()
>  // launch external process, eg MPI Task + TensorFlow
>  }
> {quote}
>  
> The problem is that external process remains running when spark task is 
> killed manually. This Jira is the place to talk about properly terminating 
> external processes launched by spark worker, when spark task is killed or 
> interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-14 Thread luzengxiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang updated SPARK-26886:

Description: 
When Embedding Deeplearning Framework in spark, spark worker has to launch 
external process(eg. MPI task) in some cases. 
{quote}val nothing = inputData.barrier().mapPartitions
 {_ => 
 val barrierTask = BarrierTaskContext.get()
 // save data to disk barrierTask.barrier()
 barrierTask.barrier()
 // launch external process, eg MPI Task + TensorFlow
 }
{quote}
 
The problem is that external process remains running when spark task is killed 
manually. This Jira is the place to talk about properly terminating external 
processes launched by spark worker, when spark task is killed or interrupt.


  was:
When Embedding Deeplearning Framework in spark, spark worker has to launch 
external process(eg. MPI task) in some cases. 

{quote} val nothing = inputData.barrier().mapPartitions
{_ => 
val barrierTask = BarrierTaskContext.get()
 //  save data to disk barrierTask.barrier()
barrierTask.barrier()
 //  launch external process, eg MPI Task + TensorFlow
}
{quote}
This Jira is talk about properly terminating external processes launched by 
spark worker, when spark task is killed or interrupt.


> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote}val nothing = inputData.barrier().mapPartitions
>  {_ => 
>  val barrierTask = BarrierTaskContext.get()
>  // save data to disk barrierTask.barrier()
>  barrierTask.barrier()
>  // launch external process, eg MPI Task + TensorFlow
>  }
> {quote}
>  
> The problem is that external process remains running when spark task is 
> killed manually. This Jira is the place to talk about properly terminating 
> external processes launched by spark worker, when spark task is killed or 
> interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-14 Thread luzengxiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang updated SPARK-26886:

External issue URL:   (was: 
https://issues.apache.org/jira/browse/SPARK-24374)

> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote} val nothing = inputData.barrier().mapPartitions
> {_ => 
> val barrierTask = BarrierTaskContext.get()
>  //  save data to disk barrierTask.barrier()
> barrierTask.barrier()
>  //  launch external process, eg MPI Task + TensorFlow
> }
> {quote}
> This Jira is talk about properly terminating external processes launched by 
> spark worker, when spark task is killed or interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-14 Thread luzengxiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang updated SPARK-26886:

External issue URL: https://issues.apache.org/jira/browse/SPARK-24374

> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote} val nothing = inputData.barrier().mapPartitions
> {_ => 
> val barrierTask = BarrierTaskContext.get()
>  //  save data to disk barrierTask.barrier()
> barrierTask.barrier()
>  //  launch external process, eg MPI Task + TensorFlow
> }
> {quote}
> This Jira is talk about properly terminating external processes launched by 
> spark worker, when spark task is killed or interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-14 Thread luzengxiang (JIRA)
luzengxiang created SPARK-26886:
---

 Summary: Proper termination of external processes launched by the 
worker
 Key: SPARK-26886
 URL: https://issues.apache.org/jira/browse/SPARK-26886
 Project: Spark
  Issue Type: New JIRA Project
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: luzengxiang


When Embedding Deeplearning Framework in spark, spark worker has to launch 
external process(eg. MPI task) in some cases. 

 {quote}
val nothing = inputData.barrier().mapPartitions{ _ =>

  val barrierTask = BarrierTaskContext.get()

//save data to disk

   barrierTask.barrier()

   //launch external process, eg MPI Task + TensorFlow
}
{quote}

This Jira is talk about properly terminating external processes launched by 
spark worker, when spark task is killed or interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-14 Thread luzengxiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang updated SPARK-26886:

Description: 
When Embedding Deeplearning Framework in spark, spark worker has to launch 
external process(eg. MPI task) in some cases. 

{quote} val nothing = inputData.barrier().mapPartitions
{_ => 
val barrierTask = BarrierTaskContext.get()
 //  save data to disk barrierTask.barrier()
barrierTask.barrier()
 //  launch external process, eg MPI Task + TensorFlow
}
{quote}
This Jira is talk about properly terminating external processes launched by 
spark worker, when spark task is killed or interrupt.

  was:
When Embedding Deeplearning Framework in spark, spark worker has to launch 
external process(eg. MPI task) in some cases. 

val nothing = inputData.barrier().mapPartitions{ _ =>

  val barrierTask = BarrierTaskContext.get()

//save data to disk

   barrierTask.barrier()

   //launch external process, eg MPI Task + TensorFlow
}


This Jira is talk about properly terminating external processes launched by 
spark worker, when spark task is killed or interrupt.


> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote} val nothing = inputData.barrier().mapPartitions
> {_ => 
> val barrierTask = BarrierTaskContext.get()
>  //  save data to disk barrierTask.barrier()
> barrierTask.barrier()
>  //  launch external process, eg MPI Task + TensorFlow
> }
> {quote}
> This Jira is talk about properly terminating external processes launched by 
> spark worker, when spark task is killed or interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-14 Thread luzengxiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang updated SPARK-26886:

Description: 
When Embedding Deeplearning Framework in spark, spark worker has to launch 
external process(eg. MPI task) in some cases. 

val nothing = inputData.barrier().mapPartitions{ _ =>

  val barrierTask = BarrierTaskContext.get()

//save data to disk

   barrierTask.barrier()

   //launch external process, eg MPI Task + TensorFlow
}


This Jira is talk about properly terminating external processes launched by 
spark worker, when spark task is killed or interrupt.

  was:
When Embedding Deeplearning Framework in spark, spark worker has to launch 
external process(eg. MPI task) in some cases. 

 {quote}
val nothing = inputData.barrier().mapPartitions{ _ =>

  val barrierTask = BarrierTaskContext.get()

//save data to disk

   barrierTask.barrier()

   //launch external process, eg MPI Task + TensorFlow
}
{quote}

This Jira is talk about properly terminating external processes launched by 
spark worker, when spark task is killed or interrupt.


> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> val nothing = inputData.barrier().mapPartitions{ _ =>
>   val barrierTask = BarrierTaskContext.get()
> //save data to disk
>barrierTask.barrier()
>//launch external process, eg MPI Task + TensorFlow
> }
> This Jira is talk about properly terminating external processes launched by 
> spark worker, when spark task is killed or interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2019-02-11 Thread luzengxiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765694#comment-16765694
 ] 

luzengxiang commented on SPARK-24374:
-

Hi [~mengxr], I am using Scala API. 

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2019-01-08 Thread luzengxiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737849#comment-16737849
 ] 

luzengxiang edited comment on SPARK-24374 at 1/9/19 4:45 AM:
-

Hi [~mengxr]:

I am tryging embedding MPI in barrier mode to support distributed tensoflow, 
just like what said in this SPIP.  The problem is MPI tasks are still runnig 
when spark task is interrupt, which means with barrier mode I can only arise 
tensorflow training task but can not stop it. How can I work it out?


was (Author: luzengxiang):
I am tryging embedding MPI in barrier mode to support distributed tensoflow, 
just like what said in this SPIP.  The problem is MPI tasks are still runnig 
when spark task is interrupt, which means by using barrier mode I can only 
arise tensorflow training task but can not stop it. How can I work it out?

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2019-01-08 Thread luzengxiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737849#comment-16737849
 ] 

luzengxiang commented on SPARK-24374:
-

I am tryging embedding MPI in barrier mode to support distributed tensoflow, 
just like what said in this SPIP.  The problem is MPI tasks are still runnig 
when spark task is interrupt, which means by using barrier mode I can only 
arise tensorflow training task but can not stop it. How can I work it out?

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org