[jira] [Closed] (SPARK-24380) argument quoting/escaping broken in mesos cluster scheduler

2018-05-25 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles closed SPARK-24380.


> argument quoting/escaping broken in mesos cluster scheduler
> ---
>
> Key: SPARK-24380
> URL: https://issues.apache.org/jira/browse/SPARK-24380
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Critical
> Fix For: 2.4.0
>
>
> When a configuration property contains shell characters that require quoting, 
> the Mesos cluster scheduler generates the spark-submit argument like so:
> {code:java}
> --conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code}
> Note the quotes around the property value as well as the key=value pair. When 
> using docker, this breaks the spark-submit command and causes the "|" to be 
> interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause 
> issues.
> Although I haven't tried, I suspect this is also a potential security issue 
> in that someone could exploit it to run arbitrary code on the host.
> My patch is pretty minimal and just removes the outer quotes around the 
> key=value pair, resulting in something like:
> {code:java}
> --conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code}
> A more extensive fix might try wrapping the entire key=value pair in single 
> quotes but I was concerned about backwards compatibility with that change.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24380) argument quoting/escaping broken in mesos cluster scheduler

2018-05-25 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles resolved SPARK-24380.
--
Resolution: Duplicate

Dupe of SPARK-23941, just a different config

> argument quoting/escaping broken in mesos cluster scheduler
> ---
>
> Key: SPARK-24380
> URL: https://issues.apache.org/jira/browse/SPARK-24380
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Critical
> Fix For: 2.4.0
>
>
> When a configuration property contains shell characters that require quoting, 
> the Mesos cluster scheduler generates the spark-submit argument like so:
> {code:java}
> --conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code}
> Note the quotes around the property value as well as the key=value pair. When 
> using docker, this breaks the spark-submit command and causes the "|" to be 
> interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause 
> issues.
> Although I haven't tried, I suspect this is also a potential security issue 
> in that someone could exploit it to run arbitrary code on the host.
> My patch is pretty minimal and just removes the outer quotes around the 
> key=value pair, resulting in something like:
> {code:java}
> --conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code}
> A more extensive fix might try wrapping the entire key=value pair in single 
> quotes but I was concerned about backwards compatibility with that change.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24380) argument quoting/escaping broken

2018-05-24 Thread paul mackles (JIRA)
paul mackles created SPARK-24380:


 Summary: argument quoting/escaping broken
 Key: SPARK-24380
 URL: https://issues.apache.org/jira/browse/SPARK-24380
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Mesos
Affects Versions: 2.3.0, 2.2.0
Reporter: paul mackles
 Fix For: 2.4.0


When a configuration property contains shell characters that require quoting, 
the Mesos cluster scheduler generates the spark-submit argument like so:
{code:java}
--conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code}
Note the quotes around the property value as well as the key=value pair. When 
using docker, this breaks the spark-submit command and causes the "|" to be 
interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause issues.

Although I haven't tried, I suspect this is also a potential security issue in 
that someone could exploit it to run arbitrary code on the host.

My patch is pretty minimal and just removes the outer quotes around the 
key=value pair, resulting in something like:
{code:java}
--conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code}
A more extensive fix might try wrapping the entire key=value pair in single 
quotes but I was concerned about backwards compatibility with that change.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24380) argument quoting/escaping broken in mesos cluster scheduler

2018-05-24 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-24380:
-
Summary: argument quoting/escaping broken in mesos cluster scheduler  (was: 
argument quoting/escaping broken)

> argument quoting/escaping broken in mesos cluster scheduler
> ---
>
> Key: SPARK-24380
> URL: https://issues.apache.org/jira/browse/SPARK-24380
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Critical
> Fix For: 2.4.0
>
>
> When a configuration property contains shell characters that require quoting, 
> the Mesos cluster scheduler generates the spark-submit argument like so:
> {code:java}
> --conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code}
> Note the quotes around the property value as well as the key=value pair. When 
> using docker, this breaks the spark-submit command and causes the "|" to be 
> interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause 
> issues.
> Although I haven't tried, I suspect this is also a potential security issue 
> in that someone could exploit it to run arbitrary code on the host.
> My patch is pretty minimal and just removes the outer quotes around the 
> key=value pair, resulting in something like:
> {code:java}
> --conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code}
> A more extensive fix might try wrapping the entire key=value pair in single 
> quotes but I was concerned about backwards compatibility with that change.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23988) [Mesos] Improve handling of appResource in mesos dispatcher when using Docker

2018-04-15 Thread paul mackles (JIRA)
paul mackles created SPARK-23988:


 Summary: [Mesos] Improve handling of appResource in mesos 
dispatcher when using Docker
 Key: SPARK-23988
 URL: https://issues.apache.org/jira/browse/SPARK-23988
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.3.0, 2.2.1
Reporter: paul mackles


Our organization makes heavy use of Docker containers when running Spark on 
Mesos. The images we use for our containers include Spark along with all of the 
application dependencies. We find this to be a great way to manage our 
artifacts.

When specifying the primary application jar (i.e. appResource), the mesos 
dispatcher insists on adding it to the list of URIs for Mesos to fetch as part 
of launching the driver's container. This leads to confusing behavior where 
paths such as:
 * file:///application.jar
 * local:/application.jar
 * /application.jar

wind up being fetched from the host where the driver is running. Obviously, 
this doesn't work since all of the above examples are referencing the path of 
the jar on the container image itself.

Here is an example that I used for testing:
{code:java}
spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://spark-dispatcher \
  --deploy-mode cluster \
  --conf spark.cores.max=4 \
  --conf spark.mesos.executor.docker.image=spark:2.2.1 \
  local:/usr/local/spark/examples/jars/spark-examples_2.11-2.2.1.jar 10{code}
The "spark:2.2.1" image contains an installation of spark under 
"/usr/local/spark". Notice how we reference the appResource using the "local:/" 
scheme.

If you try the above with the current version of the mesos dispatcher, it will 
try to fetch the path 
"/usr/local/spark/examples/jars/spark-examples_2.11-2.2.1.jar" from the host 
filesystem where the driver's container is running. On our systems, this fails 
since we don't have spark installed on the hosts. 

For the PR, all I did was modify the mesos dispatcher to not add the 
"appResource to the list of URIs for Mesos to fetch if it uses the "local:/" 
scheme.

For now, I didn't change the behavior of absolute paths or the "file:/" scheme 
because I wanted to leave some form for the old behavior in place for backwards 
compatibility. Anyone have any opinions on whether these schemes should change 
as well?

The PR also includes support for using "spark-internal" with Mesos in cluster 
mode which is something we need for another use-case. I can separate them if 
that makes more sense.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-10 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Description: 
Two changes in this PR:
 * A /health endpoint for a quick binary indication on the health of 
MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
Returns a 503 status if the server is unhealthy and a 200 if the server is 
healthy
 * A /status endpoint for a more detailed examination on the current state of a 
MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool

For both endpoints, regardless of status code, the following body is returned:

 
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "iamok",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true,
  "pendingRetryDrivers" : 0
}{code}
Aside from surfacing all of the scheduler metrics, the response also includes 
the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
other failure. When this happens, jobs queue up and the only way to clean 
things up is to restart the service. 

With the above health check, marathon can be configured to automatically 
restart the MesosClusterDispatcher service when the health check fails, 
lessening the need for manual intervention.

  was:
Two changes:

First, a more robust 
[health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
this check verifies that the MesosSchedulerDriver is still running as we have 
seen certain cases where it stops (rather quietly) and the only way to revive 
it is a restart. With this health check, marathon will restart the dispatcher 
if the MesosSchedulerDriver stops running. The health check lives at the url 
"/health" and returns a 204 when the server is healthy and a 503 when it is not 
(e.g. the MesosSchedulerDriver stopped running).

Second, a server status endpoint that replies with some basic metrics about the 
server. The status endpoint resides at the url "/status" and responds with:
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "server OK",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true
}{code}
As you can see, it includes a snapshot of the metrics/health of the scheduler. 
Useful for quick debugging/troubleshooting/monitoring. 


> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Two changes in this PR:
>  * A /health endpoint for a quick binary indication on the health of 
> MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
> marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
> Returns a 503 status if the server is unhealthy and a 200 if the server is 
> healthy
>  * A /status endpoint for a more detailed examination on the current state of 
> a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool
> For both endpoints, regardless of status code, the following body is returned:
>  
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "iamok",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true,
>   "pendingRetryDrivers" : 0
> }{code}
> Aside from surfacing all of the scheduler metrics, the response also includes 
> the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
> observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
> other failure. When this happens, jobs queue up and the only way to clean 
> things up is to restart the service. 
> With the above health check, marathon can be configured to automatically 
> restart the MesosClusterDispatcher service when the health check fails, 
> lessening the need for manual intervention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Description: 
Two changes:

First, a more robust 
[health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
this check verifies that the MesosSchedulerDriver is still running as we have 
seen certain cases where it stops (rather quietly) and the only way to revive 
it is a restart. With this health check, marathon will restart the dispatcher 
if the MesosSchedulerDriver stops running. The health check lives at the url 
"/health" and returns a 204 when the server is healthy and a 503 when it is not 
(e.g. the MesosSchedulerDriver stopped running).

Second, a server status endpoint that replies with some basic metrics about the 
server. The status endpoint resides at the url "/status" and responds with:
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "server OK",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true
}{code}
As you can see, it includes a snapshot of the metrics/health of the scheduler. 
Useful for quick debugging/troubleshooting/monitoring. 

  was:
Add a more robust health-check to MesosRestServer so that anyone who runs 
MesosClusterDispatcher as a marathon app can use it to check the health of the 
server:

[http://mesosphere.github.io/marathon/docs/health-checks.html]

Specifically, this check verifies that the MesosSchedulerDriver is still 
running as we have seen certain cases where it  dies (rather quietly) and the 
only way to revive it is a restart. With this health check, marathon will 
restart the dispatcher if the MesosSchedulerDriver stops running. 

The health check lives at the url "/health" and returns a 204 when the server 
is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
running).


> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Two changes:
> First, a more robust 
> [health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
> for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
> this check verifies that the MesosSchedulerDriver is still running as we have 
> seen certain cases where it stops (rather quietly) and the only way to revive 
> it is a restart. With this health check, marathon will restart the dispatcher 
> if the MesosSchedulerDriver stops running. The health check lives at the url 
> "/health" and returns a 204 when the server is healthy and a 503 when it is 
> not (e.g. the MesosSchedulerDriver stopped running).
> Second, a server status endpoint that replies with some basic metrics about 
> the server. The status endpoint resides at the url "/status" and responds 
> with:
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "server OK",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true
> }{code}
> As you can see, it includes a snapshot of the metrics/health of the 
> scheduler. Useful for quick debugging/troubleshooting/monitoring. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Summary: Improve observability of MesosRestServer  (was: Add more specific 
health check to MesosRestServer)

> Improve observability of MesosRestServer
> 
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add a more robust health-check to MesosRestServer so that anyone who runs 
> MesosClusterDispatcher as a marathon app can use it to check the health of 
> the server:
> [http://mesosphere.github.io/marathon/docs/health-checks.html]
> Specifically, this check verifies that the MesosSchedulerDriver is still 
> running as we have seen certain cases where it  dies (rather quietly) and the 
> only way to revive it is a restart. With this health check, marathon will 
> restart the dispatcher if the MesosSchedulerDriver stops running. 
> The health check lives at the url "/health" and returns a 204 when the server 
> is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
> running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Summary: Improve observability of MesosRestServer/MesosClusterDispatcher  
(was: Improve observability of MesosRestServer)

> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add a more robust health-check to MesosRestServer so that anyone who runs 
> MesosClusterDispatcher as a marathon app can use it to check the health of 
> the server:
> [http://mesosphere.github.io/marathon/docs/health-checks.html]
> Specifically, this check verifies that the MesosSchedulerDriver is still 
> running as we have seen certain cases where it  dies (rather quietly) and the 
> only way to revive it is a restart. With this health check, marathon will 
> restart the dispatcher if the MesosSchedulerDriver stops running. 
> The health check lives at the url "/health" and returns a 204 when the server 
> is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
> running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23943) Add more specific health check to MesosRestServer

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Description: 
Add a more robust health-check to MesosRestServer so that anyone who runs 
MesosClusterDispatcher as a marathon app can use it to check the health of the 
server:

[http://mesosphere.github.io/marathon/docs/health-checks.html]

Specifically, this check verifies that the MesosSchedulerDriver is still 
running as we have seen certain cases where it  dies (rather quietly) and the 
only way to revive it is a restart. With this health check, marathon will 
restart the dispatcher if the MesosSchedulerDriver stops running. 

The health check lives at the url "/health" and returns a 204 when the server 
is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
running).

> Add more specific health check to MesosRestServer
> -
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add a more robust health-check to MesosRestServer so that anyone who runs 
> MesosClusterDispatcher as a marathon app can use it to check the health of 
> the server:
> [http://mesosphere.github.io/marathon/docs/health-checks.html]
> Specifically, this check verifies that the MesosSchedulerDriver is still 
> running as we have seen certain cases where it  dies (rather quietly) and the 
> only way to revive it is a restart. With this health check, marathon will 
> restart the dispatcher if the MesosSchedulerDriver stops running. 
> The health check lives at the url "/health" and returns a 204 when the server 
> is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
> running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23943) Add more specific health check to MesosRestServer

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Environment: 
 

 

  was:
Added a more robust health-check to MesosRestServer so that anyone who runs 
MesosClusterDispatcher as a marathon app can use it to check the health of the 
server:

http://mesosphere.github.io/marathon/docs/health-checks.html

Specifically, this check verifies that the MesosSchedulerDriver is still 
running as we have seen certain cases where it  dies (rather quietly) and the 
only way to revive it is a restart. With this health check, marathon will 
restart the dispatcher if the MesosSchedulerDriver stops running. 

The health check lives at the url "/health" and returns a 204 when the server 
is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
running).

 


> Add more specific health check to MesosRestServer
> -
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23943) Add more specific health check to MesosRestServer

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Summary: Add more specific health check to MesosRestServer  (was: add more 
specific health check to MesosRestServer)

> Add more specific health check to MesosRestServer
> -
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment: Added a more robust health-check to MesosRestServer so 
> that anyone who runs MesosClusterDispatcher as a marathon app can use it to 
> check the health of the server:
> http://mesosphere.github.io/marathon/docs/health-checks.html
> Specifically, this check verifies that the MesosSchedulerDriver is still 
> running as we have seen certain cases where it  dies (rather quietly) and the 
> only way to revive it is a restart. With this health check, marathon will 
> restart the dispatcher if the MesosSchedulerDriver stops running. 
> The health check lives at the url "/health" and returns a 204 when the server 
> is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
> running).
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23943) add more specific health check to MesosRestServer

2018-04-09 Thread paul mackles (JIRA)
paul mackles created SPARK-23943:


 Summary: add more specific health check to MesosRestServer
 Key: SPARK-23943
 URL: https://issues.apache.org/jira/browse/SPARK-23943
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos
Affects Versions: 2.3.0, 2.2.1
 Environment: Added a more robust health-check to MesosRestServer so 
that anyone who runs MesosClusterDispatcher as a marathon app can use it to 
check the health of the server:

http://mesosphere.github.io/marathon/docs/health-checks.html

Specifically, this check verifies that the MesosSchedulerDriver is still 
running as we have seen certain cases where it  dies (rather quietly) and the 
only way to revive it is a restart. With this health check, marathon will 
restart the dispatcher if the MesosSchedulerDriver stops running. 

The health check lives at the url "/health" and returns a 204 when the server 
is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
running).

 
Reporter: paul mackles
 Fix For: 2.4.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22256) Introduce spark.mesos.driver.memoryOverhead

2018-04-09 Thread paul mackles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430262#comment-16430262
 ] 

paul mackles commented on SPARK-22256:
--

created a PR on behalf of [~clehene] since he as moved on to other projects: 
https://github.com/apache/spark/pull/21006

> Introduce spark.mesos.driver.memoryOverhead 
> 
>
> Key: SPARK-22256
> URL: https://issues.apache.org/jira/browse/SPARK-22256
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Cosmin Lehene
>Priority: Minor
>  Labels: docker, memory, mesos
> Fix For: 2.3.1, 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running spark driver in a container such as when using the Mesos 
> dispatcher service, we need to apply the same rules as for executors in order 
> to avoid the JVM going over the allotted limit and then killed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22256) Introduce spark.mesos.driver.memoryOverhead

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-22256:
-
Fix Version/s: 2.4.0
   2.3.1

> Introduce spark.mesos.driver.memoryOverhead 
> 
>
> Key: SPARK-22256
> URL: https://issues.apache.org/jira/browse/SPARK-22256
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Cosmin Lehene
>Priority: Minor
>  Labels: docker, memory, mesos
> Fix For: 2.3.1, 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running spark driver in a container such as when using the Mesos 
> dispatcher service, we need to apply the same rules as for executors in order 
> to avoid the JVM going over the allotted limit and then killed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22256) Introduce spark.mesos.driver.memoryOverhead

2018-04-09 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-22256:
-
Affects Version/s: 2.3.0

> Introduce spark.mesos.driver.memoryOverhead 
> 
>
> Key: SPARK-22256
> URL: https://issues.apache.org/jira/browse/SPARK-22256
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Cosmin Lehene
>Priority: Minor
>  Labels: docker, memory, mesos
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running spark driver in a container such as when using the Mesos 
> dispatcher service, we need to apply the same rules as for executors in order 
> to avoid the JVM going over the allotted limit and then killed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11499) Spark History Server UI should respect protocol when doing redirection

2018-01-19 Thread paul mackles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332811#comment-16332811
 ] 

paul mackles commented on SPARK-11499:
--

We ran into this issue running the spark-history server as a Marathon app on a 
Mesos cluster. As is typical for this kind of setup, there is a reverse-proxy 
that users go through to access the app. In our case, we are also offloading 
SSL to the reverse-proxy so communications between the reverse-proxy and 
spark-history are plain-old HTTP. I experimented with 2 different fixes:
 # Making sure that the SparkUI and History components look at 
APPLICATION_WEB_PROXY_BASE when generating redirect URLs. In order for it to 
honor the protocol, APPLICATION_WEB_PROXY_BASE must include the desired 
protocol (i.e. APPLICATION_WEB_PROXY_BASE=https://example.com)
 # Using Jetty's built-in ForwardRequestCustomizer class to process 
"X-Forwarded-*" headers defined in rfc7239. 

Both changes worked in our environment and both changes are fairly simple. 
Looking for feedback on whether one solution is preferable to the other. For 
our environment, #2 is preferable because:
 * The reverse proxy we use is already sending these headers. 
 * Allows for the spark-history server to see the actual client info as opposed 
to that of the proxy

If no strong feelings one way or another, I'll submit a PR for solution #2. 

References:
 * [https://tools.ietf.org/html/rfc7239]
 * 
[http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/ForwardedRequestCustomizer.html]

 

 

> Spark History Server UI should respect protocol when doing redirection
> --
>
> Key: SPARK-11499
> URL: https://issues.apache.org/jira/browse/SPARK-11499
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Lukasz Jastrzebski
>Priority: Major
>
> Use case:
> Spark history server is behind load balancer secured with ssl certificate,
> unfortunately clicking on the application link redirects it to http protocol, 
> which may be not expose by load balancer, example flow:
> *   Trying 52.22.220.1...
> * Connected to xxx.yyy.com (52.22.220.1) port 8775 (#0)
> * WARNING: SSL: Certificate type not set, assuming PKCS#12 format.
> * Client certificate: u...@yyy.com
> * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
> * Server certificate: *.yyy.com
> * Server certificate: Entrust Certification Authority - L1K
> * Server certificate: Entrust Root Certification Authority - G2
> > GET /history/20151030-160604-3039174572-5951-22401-0004 HTTP/1.1
> > Host: xxx.yyy.com:8775
> > User-Agent: curl/7.43.0
> > Accept: */*
> >
> < HTTP/1.1 302 Found
> < Location: 
> http://xxx.yyy.com:8775/history/20151030-160604-3039174572-5951-22401-0004
> < Connection: close
> < Server: Jetty(8.y.z-SNAPSHOT)
> <
> * Closing connection 0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23088) History server not showing incomplete/running applications

2018-01-16 Thread paul mackles (JIRA)
paul mackles created SPARK-23088:


 Summary: History server not showing incomplete/running applications
 Key: SPARK-23088
 URL: https://issues.apache.org/jira/browse/SPARK-23088
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.2.1, 2.1.2
Reporter: paul mackles


History server not showing incomplete/running applications when 
_spark.history.ui.maxApplications_ property is set to a value that is smaller 
than the total number of applications.

I believe this is because any applications where completed=false wind up at the 
end of the list of apps returned by the /applications endpoint and when 
_spark.history.ui.maxApplications_ is set, that list gets truncated and the 
running apps are never returned.

The fix I have in mind is to modify the history template to start passing the 
_status_ parameter when calling the /applications endpoint (status=completed is 
the default).

I am running Spark in a Mesos environment but I don't think that is relevant to 
this issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22528) History service and non-HDFS filesystems

2017-11-21 Thread paul mackles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260727#comment-16260727
 ] 

paul mackles commented on SPARK-22528:
--

In case anyone else bumps into this, I received some feedback from the 
data-lake team at MSFT:

This is expected behavior since Hadoop supports Kerberos based identity whereas 
data lake supports OAuth2 – Azure active directory (AAD). The bridge/mapping 
between Kerberos and AAD OAuth2 is supported only in Azure HDInsight cluster 
today.
 
OAuth2 support in Hadoop is non-trivial task and is in progress - 
https://issues.apache.org/jira/browse/HADOOP-11744
Workaround for the limitation is (Specific to data lake)
core-site.xml
 {code}

adl.debug.override.localuserasfileowner
true

 {code}

What does this configuration do ?
FileStatus contains the user/group information which is associated with object 
id from AAD. Hadoop driver would replace object id with local Hadoop user under 
the context of Hadoop process. Actual file information in data lake remains 
unchanged though, only shadowed behind the local Hadoop user.


> History service and non-HDFS filesystems
> 
>
> Key: SPARK-22528
> URL: https://issues.apache.org/jira/browse/SPARK-22528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: paul mackles
>Priority: Minor
>
> We are using Azure Data Lake (ADL) to store our event logs. This worked fine 
> in 2.1.x but in 2.2.0, the event logs are no longer visible to the history 
> server. I tracked it down to the call to:
> {code}
> SparkHadoopUtil.get.checkAccessPermission()
> {code}
> which was added to "FSHistoryProvider" in 2.2.0.
> I was able to workaround it by:
> * setting the files on ADL to world readable
> * setting HADOOP_PROXY to the Azure objectId of the service principal that 
> owns file
> Neither of those workaround are particularly desirable in our environment. 
> That said, I am not sure how this should be addressed:
> * Is this an issue with the Azure/Hadoop bindings not setting up the user 
> context correctly so that the "checkAccessPermission()" call succeeds w/out 
> having to use the username under which the process is running?
> * Is this an issue with "checkAccessPermission()" not really accounting for 
> all of the possible FileSystem implementations? If so, I would imagine that 
> there are similar issues when using S3.
> In spite of this check, I know the files are accessible through the 
> underlying FileSystem object so it feels like the latter but I don't think 
> that the FileSystem object alone could be used to implement this check.
> Any thoughts [~jerryshao]?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22528) History service and non-HDFS filesystems

2017-11-15 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-22528:
-
Description: 
We are using Azure Data Lake (ADL) to store our event logs. This worked fine in 
2.1.x but in 2.2.0, the event logs are no longer visible to the history server. 
I tracked it down to the call to:

{code}
SparkHadoopUtil.get.checkAccessPermission()
{code}

which was added to "FSHistoryProvider" in 2.2.0.

I was able to workaround it by:
* setting the files on ADL to world readable
* setting HADOOP_PROXY to the Azure objectId of the service principal that owns 
file

Neither of those workaround are particularly desirable in our environment. That 
said, I am not sure how this should be addressed:
* Is this an issue with the Azure/Hadoop bindings not setting up the user 
context correctly so that the "checkAccessPermission()" call succeeds w/out 
having to use the username under which the process is running?
* Is this an issue with "checkAccessPermission()" not really accounting for all 
of the possible FileSystem implementations? If so, I would imagine that there 
are similar issues when using S3.

In spite of this check, I know the files are accessible through the underlying 
FileSystem object so it feels like the latter but I don't think that the 
FileSystem object alone could be used to implement this check.

Any thoughts [~jerryshao]?


  was:
We are using Azure Data Lake (ADL) to store our event logs. This worked fine in 
2.1.x but in 2.2.0, the event logs are no longer visible to the history server. 
I tracked it down to the call to:

{code}
SparkHadoopUtil.get.checkAccessPermission()
{code}

which was added to "FSHistoryProvider" in 2.2.0.

I was able to workaround it by:
* setting the files to world readable
* setting HADOOP_PROXY to the Azure objectId of the service principal that owns 
file

Neither of those workaround are particularly desirable in our environment. That 
said, I am not sure how this should be addressed:
* Is this an issue with the Azure/Hadoop bindings not setting up the user 
context correctly so that the "checkAccessPermission()" call succeeds w/out 
having to use the username under which the process is running?
* Is this an issue with "checkAccessPermission()" not really accounting for all 
of the possible FileSystem implementations? If so, I would imagine that there 
are similar issues with using S3.

In spite of this check, I know the files are accessible through the underlying 
FileSystem object so it feels like the latter but I don't that the FileSystem 
object alone could be used to implement this check.



> History service and non-HDFS filesystems
> 
>
> Key: SPARK-22528
> URL: https://issues.apache.org/jira/browse/SPARK-22528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: paul mackles
>Priority: Minor
>
> We are using Azure Data Lake (ADL) to store our event logs. This worked fine 
> in 2.1.x but in 2.2.0, the event logs are no longer visible to the history 
> server. I tracked it down to the call to:
> {code}
> SparkHadoopUtil.get.checkAccessPermission()
> {code}
> which was added to "FSHistoryProvider" in 2.2.0.
> I was able to workaround it by:
> * setting the files on ADL to world readable
> * setting HADOOP_PROXY to the Azure objectId of the service principal that 
> owns file
> Neither of those workaround are particularly desirable in our environment. 
> That said, I am not sure how this should be addressed:
> * Is this an issue with the Azure/Hadoop bindings not setting up the user 
> context correctly so that the "checkAccessPermission()" call succeeds w/out 
> having to use the username under which the process is running?
> * Is this an issue with "checkAccessPermission()" not really accounting for 
> all of the possible FileSystem implementations? If so, I would imagine that 
> there are similar issues when using S3.
> In spite of this check, I know the files are accessible through the 
> underlying FileSystem object so it feels like the latter but I don't think 
> that the FileSystem object alone could be used to implement this check.
> Any thoughts [~jerryshao]?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22528) History service and non-HDFS filesystems

2017-11-15 Thread paul mackles (JIRA)
paul mackles created SPARK-22528:


 Summary: History service and non-HDFS filesystems
 Key: SPARK-22528
 URL: https://issues.apache.org/jira/browse/SPARK-22528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: paul mackles
Priority: Minor


We are using Azure Data Lake (ADL) to store our event logs. This worked fine in 
2.1.x but in 2.2.0, the event logs are no longer visible to the history server. 
I tracked it down to the call to:

{code}
SparkHadoopUtil.get.checkAccessPermission()
{code}

which was added to "FSHistoryProvider" in 2.2.0.

I was able to workaround it by:
* setting the files to world readable
* setting HADOOP_PROXY to the Azure objectId of the service principal that owns 
file

Neither of those workaround are particularly desirable in our environment. That 
said, I am not sure how this should be addressed:
* Is this an issue with the Azure/Hadoop bindings not setting up the user 
context correctly so that the "checkAccessPermission()" call succeeds w/out 
having to use the username under which the process is running?
* Is this an issue with "checkAccessPermission()" not really accounting for all 
of the possible FileSystem implementations? If so, I would imagine that there 
are similar issues with using S3.

In spite of this check, I know the files are accessible through the underlying 
FileSystem object so it feels like the latter but I don't that the FileSystem 
object alone could be used to implement this check.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22287) [MESOS] SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher

2017-10-16 Thread paul mackles (JIRA)
paul mackles created SPARK-22287:


 Summary: [MESOS] SPARK_DAEMON_MEMORY not honored by 
MesosClusterDispatcher
 Key: SPARK-22287
 URL: https://issues.apache.org/jira/browse/SPARK-22287
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.2.0, 2.1.1, 2.3.0
Reporter: paul mackles
Priority: Minor


There does not appear to be a way to control the heap size used by 
MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not 
honored for that particular daemon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher

2017-10-16 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-22287:
-
Summary: SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher  (was: 
[MESOS] SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher)

> SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
> -
>
> Key: SPARK-22287
> URL: https://issues.apache.org/jira/browse/SPARK-22287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Minor
>
> There does not appear to be a way to control the heap size used by 
> MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not 
> honored for that particular daemon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12832) Mesos cluster mode should handle constraints

2017-09-30 Thread paul mackles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187059#comment-16187059
 ] 

paul mackles commented on SPARK-12832:
--

I am thinking this one should be closed. It appears to be a a duplicate of 
SPARK-19606

> Mesos cluster mode should handle constraints
> 
>
> Key: SPARK-12832
> URL: https://issues.apache.org/jira/browse/SPARK-12832
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: astralidea
>
> on mesos some machine use for spark others use for e.g.ELK cluster.but if 
> driver does not have spark runtime.when deploy a spark job.driver will fail 
> because it run on no spark runtime machine.CoarseMesosSchedulerBackend have 
> constraints feature but dispacher deploy use MesosClusterScheduler, it is 
> different method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19606) Support constraints in spark-dispatcher

2017-09-30 Thread paul mackles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187058#comment-16187058
 ] 

paul mackles commented on SPARK-19606:
--

+1 to being able to constrain drivers and another +1 to [~pgillet]'s suggestion 
for allowing drivers to be constrained to different resources than the 
executors. 

However, given my understanding, I don't think that using 
"spark.mesos.dispatcher.driverDefault.spark.mesos.constraints" will work. If 
"spark.mesos.constraints" is passed with the job then it will wind up 
overriding the value specified in the "driverDefault" property. If 
"spark.mesos.constraints" is not passed with the job, then the value specified 
in the "driverDefault" property will get passed to the executors - which we 
definitely don't want.

To maintain backwards compatibility while allowing drivers/executors to be 
constrained to either the same or different resources, I propose an additional 
property:

spark.mesos.constraints.driver

The new property could be set per job or for all jobs using 
"spark.mesos.dispatcher.driverDefault.*". The existing property 
"spark.mesos.constraints" would continue to apply to executors only.

If we can come to a consensus on this, I am happy to work on the PR

> Support constraints in spark-dispatcher
> ---
>
> Key: SPARK-19606
> URL: https://issues.apache.org/jira/browse/SPARK-19606
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Philipp Hoffmann
>
> The `spark.mesos.constraints` configuration is ignored by the 
> spark-dispatcher. The constraints need to be passed in the Framework 
> information when registering with Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22135) metrics in spark-dispatcher not being registered properly

2017-09-26 Thread paul mackles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16181804#comment-16181804
 ] 

paul mackles commented on SPARK-22135:
--

here is the PR: https://github.com/apache/spark/pull/19358

> metrics in spark-dispatcher not being registered properly
> -
>
> Key: SPARK-22135
> URL: https://issues.apache.org/jira/browse/SPARK-22135
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 2.1.0, 2.2.0
>Reporter: paul mackles
>Priority: Minor
>
> There is a bug in the way that the metrics in 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterSchedulerSource are 
> initialized such that they are never registered with the underlying registry. 
> Basically, each call to the overridden "metricRegistry" function results in 
> the creation of a new registry. PR is forthcoming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22135) metrics in spark-dispatcher not being registered properly

2017-09-26 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-22135:
-
Description: There is a bug in the way that the metrics in 
org.apache.spark.scheduler.cluster.mesos.MesosClusterSchedulerSource are 
initialized such that they are never registered with the underlying registry. 
Basically, each call to the overridden "metricRegistry" function results in the 
creation of a new registry. PR is forthcoming.  (was: There is a bug in the way 
that the metrics in 
org.apache.spark.scheduler.cluster.mesos.MesosClusterSchedulerSource are 
initialized such that they are never registered with the underlying registry. 
Basically, each call to the overridden "metricRegistry" function results in the 
creation of a new registry. Patch is forthcoming.)

> metrics in spark-dispatcher not being registered properly
> -
>
> Key: SPARK-22135
> URL: https://issues.apache.org/jira/browse/SPARK-22135
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 2.1.0, 2.2.0
>Reporter: paul mackles
>Priority: Minor
>
> There is a bug in the way that the metrics in 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterSchedulerSource are 
> initialized such that they are never registered with the underlying registry. 
> Basically, each call to the overridden "metricRegistry" function results in 
> the creation of a new registry. PR is forthcoming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22135) metrics in spark-dispatcher not being registered properly

2017-09-26 Thread paul mackles (JIRA)
paul mackles created SPARK-22135:


 Summary: metrics in spark-dispatcher not being registered properly
 Key: SPARK-22135
 URL: https://issues.apache.org/jira/browse/SPARK-22135
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Mesos
Affects Versions: 2.2.0, 2.1.0
Reporter: paul mackles
Priority: Minor


There is a bug in the way that the metrics in 
org.apache.spark.scheduler.cluster.mesos.MesosClusterSchedulerSource are 
initialized such that they are never registered with the underlying registry. 
Basically, each call to the overridden "metricRegistry" function results in the 
creation of a new registry. Patch is forthcoming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org