[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-05-24 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022481#comment-16022481
 ] 

Kamal Gurala commented on SPARK-4899:
-

[~mgummelt] I found the discussion here
http://apache-spark-developers-list.1001551.n3.nabble.com/Mesos-checkpointing-td21293.html

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-03 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995493#comment-15995493
 ] 

Kamal Gurala commented on SPARK-20562:
--

Spark is not smart about offers that might be scheduled for maintenance i.e. 
has an Unavailability period set. It also cannot estimate the amount of time a 
scheduled Task would use an Offer for.
It is however easier for Users to guess how long they think a Task would run 
for. So they can easily set a Configurable Threshold(x) that makes the Spark 
scheduler be wary of which offers it can accept and which ones might go under 
maintenance in x amount of time.
 

> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-03 Thread Kamal Gurala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Gurala updated SPARK-20562:
-
Issue Type: Improvement  (was: Bug)

> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-02 Thread Kamal Gurala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Gurala updated SPARK-20562:
-
Description: 
Make Spark be aware of offers that have an unavailability period set because of 
a scheduled Maintenance on the node.

Have a configurable option that's a threshold which ensures that tasks are not 
scheduled on offers that are within a threshold for maintenance

  was:
Make Spark be aware of offers that have an unavailability period set because of 
a scheduled Maintenance on the node.
Have a configurable option that's a threshold which ensures that tasks are not 
scheduled on offers that are within a threshold for maintenance


> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-02 Thread Kamal Gurala (JIRA)
Kamal Gurala created SPARK-20562:


 Summary: Support Maintenance by having a threshold for 
unavailability
 Key: SPARK-20562
 URL: https://issues.apache.org/jira/browse/SPARK-20562
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Kamal Gurala


Make Spark be aware of offers that have an unavailability period set because of 
a scheduled Maintenance on the node.
Have a configurable option that's a threshold which ensures that tasks are not 
scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-05-01 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992255#comment-15992255
 ] 

Kamal Gurala commented on SPARK-4899:
-

Can one of the admins verify this patch?


> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-05-01 Thread Kamal Gurala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Gurala updated SPARK-4899:

Comment: was deleted

(was: Can one of the admins verify this patch?
)

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20419) Support for Mesos Maintenance primitives

2017-05-01 Thread Kamal Gurala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Gurala updated SPARK-20419:
-
Comment: was deleted

(was: Can one of the admins verify this patch?)

> Support for Mesos Maintenance primitives
> 
>
> Key: SPARK-20419
> URL: https://issues.apache.org/jira/browse/SPARK-20419
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>Priority: Minor
>
> With Mesos 0.25.0, maintenance primitives have been added.
> https://mesos.apache.org/documentation/latest/maintenance/
> Based on the documentation it appears frameworks can be maintenance aware.
> Spark should be able respect mesos concepts of maintenance modes or inverse 
> offers or unavailability and not schedule tasks on resources that will go 
> under maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20419) Support for Mesos Maintenance primitives

2017-05-01 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992229#comment-15992229
 ] 

Kamal Gurala commented on SPARK-20419:
--

Can one of the admins verify this patch?

> Support for Mesos Maintenance primitives
> 
>
> Key: SPARK-20419
> URL: https://issues.apache.org/jira/browse/SPARK-20419
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>Priority: Minor
>
> With Mesos 0.25.0, maintenance primitives have been added.
> https://mesos.apache.org/documentation/latest/maintenance/
> Based on the documentation it appears frameworks can be maintenance aware.
> Spark should be able respect mesos concepts of maintenance modes or inverse 
> offers or unavailability and not schedule tasks on resources that will go 
> under maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-04-24 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981837#comment-15981837
 ] 

Kamal Gurala commented on SPARK-4899:
-

https://github.com/apache/spark/pull/17750

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20419) Support for Mesos Maintenance primitives

2017-04-20 Thread Kamal Gurala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Gurala updated SPARK-20419:
-
Description: 
With Mesos 0.25.0, maintenance primitives have been added.
https://mesos.apache.org/documentation/latest/maintenance/
Based on the documentation it appears frameworks can be maintenance aware.
Spark should be able respect mesos concepts of maintenance modes or inverse 
offers or unavailability and not schedule tasks on resources that will go under 
maintenance.

  was:
With Mesos 0.25.0, maintenance primitives have been added.
https://mesos.apache.org/documentation/latest/maintenance/
Based on the documentation it appears frameworks can be maintenance aware.
Spark should be able respect mesos concepts of maintenance modes or inverse 
offers and not schedule tasks on resources that will go under maintenance.


> Support for Mesos Maintenance primitives
> 
>
> Key: SPARK-20419
> URL: https://issues.apache.org/jira/browse/SPARK-20419
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>Priority: Minor
>
> With Mesos 0.25.0, maintenance primitives have been added.
> https://mesos.apache.org/documentation/latest/maintenance/
> Based on the documentation it appears frameworks can be maintenance aware.
> Spark should be able respect mesos concepts of maintenance modes or inverse 
> offers or unavailability and not schedule tasks on resources that will go 
> under maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20419) Support for Mesos Maintenance primitives

2017-04-20 Thread Kamal Gurala (JIRA)
Kamal Gurala created SPARK-20419:


 Summary: Support for Mesos Maintenance primitives
 Key: SPARK-20419
 URL: https://issues.apache.org/jira/browse/SPARK-20419
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Scheduler
Affects Versions: 2.1.0
Reporter: Kamal Gurala
Priority: Minor


With Mesos 0.25.0, maintenance primitives have been added.
https://mesos.apache.org/documentation/latest/maintenance/
Based on the documentation it appears frameworks can be maintenance aware.
Spark should be able respect mesos concepts of maintenance modes or inverse 
offers and not schedule tasks on resources that will go under maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-04-03 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954312#comment-15954312
 ] 

Kamal Gurala commented on SPARK-4899:
-

Some performance related concerns 
https://github.com/apache/spark/pull/60#r16817226

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20054) [Mesos] Detectability for resource starvation

2017-03-22 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936759#comment-15936759
 ] 

Kamal Gurala commented on SPARK-20054:
--

Yes, the logs do help detect the issue. 
Do you think having  a new config option that gives resources back to the 
cluster if `spark.scheduler.minRegisteredResourcesRatio` is not met after a 
certain amount of configurable amount of time, would be of interest ?

> [Mesos] Detectability for resource starvation
> -
>
> Key: SPARK-20054
> URL: https://issues.apache.org/jira/browse/SPARK-20054
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Kamal Gurala
>Priority: Minor
>
> We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We 
> had a production issue recently wherein we had our spark frameworks accept 
> resources from the Mesos master, so executors were started and spark driver 
> was aware of them, but the driver didn’t plan any task and nothing was 
> happening for a long time because it didn't meet a minimum registered 
> resources threshold. and the cluster is usually under-provisioned in order 
> because not all the jobs need to run at the same time. These held resources 
> were never offered back to the master for re-allocation leading to the entire 
> cluster to a halt until we had to manually intervene. 
> Using DRF for mesos and FIFO for Spark and the cluster is usually 
> under-provisioned. At any point of time there could be 10-15 spark frameworks 
> running on Mesos on the under-provisioned cluster 
> The ask is to have a way to better recoverability or detectability for a 
> scenario where the individual Spark frameworks hold onto resources but never 
> launch any tasks or have these frameworks release these resources after a 
> fixed amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20054) [Mesos] Detectability for resource starvation

2017-03-21 Thread Kamal Gurala (JIRA)
Kamal Gurala created SPARK-20054:


 Summary: [Mesos] Detectability for resource starvation
 Key: SPARK-20054
 URL: https://issues.apache.org/jira/browse/SPARK-20054
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Scheduler
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Kamal Gurala
Priority: Minor


We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We 
had a production issue recently wherein we had our spark frameworks accept 
resources from the Mesos master, so executors were started and spark driver was 
aware of them, but the driver didn’t plan any task and nothing was happening 
for a long time because it didn't meet a minimum registered resources 
threshold. and the cluster is usually under-provisioned in order because not 
all the jobs need to run at the same time. These held resources were never 
offered back to the master for re-allocation leading to the entire cluster to a 
halt until we had to manually intervene. 

Using DRF for mesos and FIFO for Spark and the cluster is usually 
under-provisioned. At any point of time there could be 10-15 spark frameworks 
running on Mesos on the under-provisioned cluster 

The ask is to have a way to better recoverability or detectability for a 
scenario where the individual Spark frameworks hold onto resources but never 
launch any tasks or have these frameworks release these resources after a fixed 
amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org