[ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086
 ] 

Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:18 PM:
----------------------------------------------------------------------

The question is what can you do when you dont have a distributed cache like in 
the yarn case. Do we need to upload artifacts in the first place or fetch them 
remotely (eg. cluster mode)? Mesos has the same issue AFAIK and assumes that 
artifacts are available to all slaves via a url 
([http://mesos.apache.org/documentation/latest/fetcher]).

Having pre-populated PVs is not different to me as a mechanism compared to 
images since no uploading takes place from the submission side to the driver 
via spark submit. Someone has to approve PVs contents too as well when it comes 
to security. If we can do it in Spark without going down the path of using K8s 
constructs like init containers without performance issues then we should be 
ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when 
they update their dependencies and that contradicts the third point.

But what do you do when you need driver HA (many people use that)? Then you 
need check-pointing and you need to store artifacts to some storage like PVs or 
custom images or hdfs (distributed storage in general via hadoop API). If we 
omit the last two then the only option I see is PVs where client artifacts are 
uploaded via the PVs thing. On the other hand PVs can be hard to manage in 
general from an administration perspective and that breaks a bit the UX (users 
are lazy, they would say just let me point to my artifact from the spark submit 
side) .

On the other hand you cant really expose the driver's internal file server 
because it is not persistent unless you make it store its artifacts to a PV 
instead of the container tmp dir. In that scenario we could upload artifacts to 
the driver directly and allow restarts.

One last option for k8s only mode would be Spark operator as it could also 
behave as a staging server.  Some thoughts...


was (Author: skonto):
The question is what can you do when you dont have a distributed cache like in 
the yarn case. Do we need to upload artifacts in the first place or fetch them 
remotely (eg. cluster mode)? Mesos has the same issue AFAIK and assumes that 
artifacts are available to all slaves via a url 
([http://mesos.apache.org/documentation/latest/fetcher]).

Having pre-populated PVs is not different to me as a mechanism compared to 
images since no uploading takes place from the submission side to the driver 
via spark submit. Someone has to approve PVs contents too as well when it comes 
to security. If we can do it in Spark without going down the path of using K8s 
constructs like init containers without performance issues then we should be 
ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when 
they update their dependencies and that contradicts the third point.

But what do you do when you need driver HA (many people use that)? Then you 
need check-pointing and you need to store artifacts to some storage like PVs or 
custom images or hdfs (distributed storage in general via hadoop API). If we 
omit the last two then the only option I see is PVs where client artifacts are 
uploaded via the PVs thing. On the other hand PVs can be hard to manage in 
general from an administration perspective and that breaks a bit the UX (users 
are lazy, they would say just let me point to my artifact from the spark submit 
side) .

On the other hand you cant really expose the driver's internal file server 
because it is not persistent unless you make it store its artifacts to a PV 
instead of the container tmp dir. In that scenario we could upload artifacts to 
the driver directly and allow restarts.

  

> Support application dependencies in submission client's local file system
> -------------------------------------------------------------------------
>
>                 Key: SPARK-23153
>                 URL: https://issues.apache.org/jira/browse/SPARK-23153
>             Project: Spark
>          Issue Type: Improvement
>          Components: Kubernetes
>    Affects Versions: 2.4.0
>            Reporter: Yinan Li
>            Priority: Major
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to