I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2
<https://github.com/amplab/spark-ec2> is so slow to launch clusters is that
it distributes files like the Spark binaries to all the workers via the
master. Because of that, the launch time scaled with the number of workers
requested <https://issues.apache.org/jira/browse/SPARK-5189>.

When I wrote Flintrock <https://github.com/nchammas/flintrock>, I got a
large improvement in launch time over spark-ec2 simply by having all the
workers download the installation files in parallel from an external host
(typically S3 or an Apache mirror). And launch time became largely
independent of the cluster size.

That may or may not say anything about the driver distributing application
files vs. having init containers do it in parallel, but I’d be curious to
hear more.

Nick
​

On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan
<ramanath...@google.com.invalid> wrote:

> We were running a change in our fork which was similar to this at one
> point early on. My biggest concerns off the top of my head with this change
> would be localization performance with large numbers of executors, and what
> we lose in terms of separation of concerns. Init containers are a standard
> construct in k8s for resource localization. Also how this approach affects
> the HDFS work would be interesting.
>
> +matt +kimoon
> Still thinking about the potential trade offs here. Adding Matt and Kimoon
> who would remember more about our reasoning at the time.
>
>
> On Jan 9, 2018 5:22 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote:
>
>> Hello,
>>
>> Me again. I was playing some more with the kubernetes backend and the
>> whole init container thing seemed unnecessary to me.
>>
>> Currently it's used to download remote jars and files, mount the
>> volume into the driver / executor, and place those jars in the
>> classpath / move the files to the working directory. This is all stuff
>> that spark-submit already does without needing extra help.
>>
>> So I spent some time hacking stuff and removing the init container
>> code, and launching the driver inside kubernetes using spark-submit
>> (similar to how standalone and mesos cluster mode works):
>>
>> https://github.com/vanzin/spark/commit/k8s-no-init
>>
>> I'd like to point out the output of "git show --stat" for that diff:
>>  29 files changed, 130 insertions(+), 1560 deletions(-)
>>
>> You get massive code reuse by simply using spark-submit. The remote
>> dependencies are downloaded in the driver, and the driver does the job
>> of service them to executors.
>>
>> So I guess my question is: is there any advantage in using an init
>> container?
>>
>> The current init container code can download stuff in parallel, but
>> that's an easy improvement to make in spark-submit and that would
>> benefit everybody. You can argue that executors downloading from
>> external servers would be faster than downloading from the driver, but
>> I'm not sure I'd agree - it can go both ways.
>>
>> Also the same idea could probably be applied to starting executors;
>> Mesos starts executors using "spark-class" already, so doing that
>> would both improve code sharing and potentially simplify some code in
>> the k8s backend.
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Reply via email to