[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:25 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw regarding the S3 prefix, if I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3, this was intended for local files only. Feel free to add any other capabilities. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw i I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:23 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw i I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:21 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188222#comment-17188222 ] Xuzhou Yin edited comment on SPARK-23153 at 9/1/20, 7:37 AM: - Hi guys, I have looked through the pull request of this change, and there is one part which I don't quite understand, it would be awesome if someone can explain it a little bit. At this line: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,] Spark filters out all paths which are not local (ie. no scheme or [file://|file:///] scheme). Does it mean it will ignore all other paths which are not local? For example, when starting a Spark job with spark.jars=local:///local/path/1.jar,s3://s3/path/2.jar,[file:///local/path/3.jar], it seems like this logic will upload [file:///local/path/3.jar] to s3, and reset spark.jars to only s3://upload/path/3.jar, while completely ignoring local:///local/path/1.jar and s3:///s3/path/2.jar. Is this an expected behavior? If so, how should we do if we want to specify dependencies which are in HCFS such as S3, or driver's local (ie. local://) instead of [file://?|file:///?] If this is a bug, is there a Jira issue for it? Thanks a lot! @[~skonto] was (Author: xuzhoyin): Hi guys, I have looked through the pull request of this change, and there is one part which I don't quite understand, it would be awesome if someone can explain it a little bit. At this line: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,] Spark filters out all paths which are not local (ie. no scheme or [file://|file:///] scheme). Does it mean it will ignore all other paths which are not local? For example, when starting a Spark job with spark.jars=local:///local/path/1.jar,s3://s3/path/2.jar,[file:///local/path/3.jar], it seems like this logic will upload [file:///local/path/3.jar] to s3, and reset spark.jars to only s3://upload/path/3.jar, while completely ignoring local:///local/path/1.jar and s3:///s3/path/2.jar. Is this an expected behavior? If so, how should we do if we want to specify dependencies which are in HCFS such as S3, or driver's local (ie. local://) instead of file://? If this is a bug, is there a Jira issue for it? Thanks a lot! > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188222#comment-17188222 ] Xuzhou Yin edited comment on SPARK-23153 at 9/1/20, 7:33 AM: - Hi guys, I have looked through the pull request of this change, and there is one part which I don't quite understand, it would be awesome if someone can explain it a little bit. At this line: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,] Spark filters out all paths which are not local (ie. no scheme or [file://|file:///] scheme). Does it mean it will ignore all other paths which are not local? For example, when starting a Spark job with spark.jars=local:///local/path/1.jar,s3://s3/path/2.jar,[file:///local/path/3.jar], it seems like this logic will upload [file:///local/path/3.jar] to s3, and reset spark.jars to only s3://upload/path/3.jar, while completely ignoring local:///local/path/1.jar and s3:///s3/path/2.jar. Is this an expected behavior? If so, how should we do if we want to specify dependencies which are in HCFS such as S3, or driver's local (ie. local://) instead of file://? If this is a bug, is there a Jira issue for it? Thanks a lot! was (Author: xuzhoyin): Hi guys, I have looked through the pull request of this change, and does not quite understand one part, it would be awesome if someone can explain it a little bit. At this line: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,] it filters out all paths which are not local (ie. no scheme or file:// scheme). Does it ignore all other paths which are not local? For example, when starting a Spark job with spark.jars=local:///local/path/1.jar,s3:///s3/path/2.jar,file:///local/path/3.jar, it seems like this logic will upload file:///local/path/3.jar to s3, and reset spark.jars to only s3://upload/path/3.jar, while completely ignoring local:///local/path/1.jar and s3:///s3/path/2.jar. Is this expected behavior? If so, how should we do if we want to specify dependencies which are in HCFS such as S3 instead of local? If this is a bug, is there a Jira issue for it? Thanks a lot! > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876192#comment-16876192 ] Stavros Kontopoulos edited comment on SPARK-23153 at 7/1/19 1:41 PM: - [~ejblanco] nope. [~cloud_fan] is there going to be another 2.4.x release? Does it make sense to backport? was (Author: skonto): [~cloud_fan] is there going to be another 2.4.x release? Does it make sense to backport? > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700356#comment-16700356 ] Stavros Kontopoulos edited comment on SPARK-23153 at 11/27/18 12:40 PM: [~eje] [~rvesse] [~liyinan926] working on a document to capture options here: [https://docs.google.com/document/d/1peg_qVhLaAl4weo5C51jQicPwLclApBsdR1To2fgc48] was (Author: skonto): [~eje] [~rvesse] [~liyinan926] working on the document to capture options here: https://docs.google.com/document/d/1peg_qVhLaAl4weo5C51jQicPwLclApBsdR1To2fgc48 > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:12 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV. In that scenario we could upload artifacts to the driver directly. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX. On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV. In that scenario we could upload artifacts to the driver directly. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:10 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX. On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV. In that scenario we could upload artifacts to the driver directly. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX. On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:13 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV instead of the container tmp dir. In that scenario we could upload artifacts to the driver directly and allow restarts. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV. In that scenario we could upload artifacts to the driver directly. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:08 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:18 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK and assumes that artifacts are available to all slaves via a url ([http://mesos.apache.org/documentation/latest/fetcher]). Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general via hadoop API). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV instead of the container tmp dir. In that scenario we could upload artifacts to the driver directly and allow restarts. One last option for k8s only mode would be Spark operator as it could also behave as a staging server. Some thoughts... was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK and assumes that artifacts are available to all slaves via a url ([http://mesos.apache.org/documentation/latest/fetcher]). Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general via hadoop API). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV instead of the container tmp dir. In that scenario we could upload artifacts to the driver directly and allow restarts. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:07 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:16 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK and assumes that artifacts are available to all slaves via a url ([http://mesos.apache.org/documentation/latest/fetcher]). Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general via hadoop API). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV instead of the container tmp dir. In that scenario we could upload artifacts to the driver directly and allow restarts. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK and assumes that artifacts are available to all slaves via a url (http://mesos.apache.org/documentation/latest/fetcher). Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV instead of the container tmp dir. In that scenario we could upload artifacts to the driver directly and allow restarts. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-u
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:14 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK and assumes that artifacts are available to all slaves via a url (http://mesos.apache.org/documentation/latest/fetcher). Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV instead of the container tmp dir. In that scenario we could upload artifacts to the driver directly and allow restarts. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX (users are lazy, they would say just let me point to my artifact from the spark submit side) . On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV instead of the container tmp dir. In that scenario we could upload artifacts to the driver directly and allow restarts. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640086#comment-16640086 ] Stavros Kontopoulos edited comment on SPARK-23153 at 10/5/18 5:10 PM: -- The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. On the other hand PVs can be hard to manage in general from an administration perspective and that breaks a bit the UX. On the other hand you cant really expose the driver's internal file server because it is not persistent unless you make it store its artifacts to a PV. was (Author: skonto): The question is what can you do when you dont have a distributed cache like in the yarn case. Do we need to upload artifacts in the first place or fetch them remotely (eg. cluster mode)? Mesos has the same issue AFAIK. Having pre-populated PVs is not different to me as a mechanism compared to images since no uploading takes place from the submission side to the driver via spark submit. Someone has to approve PVs contents too as well when it comes to security. If we can do it in Spark without going down the path of using K8s constructs like init containers without performance issues then we should be ok. Even now, if not mistaken, executors on k8s fetch jars from the driver when they update their dependencies and that contradicts the third point. But what do you do when you need driver HA (many people use that)? Then you need check-pointing and you need to store artifacts to some storage like PVs or custom images or hdfs (distributed storage in general). If we omit the last two then the only option I see is PVs where client artifacts are uploaded via the PVs thing. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org