On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> You can argue that executors downloading from
> external servers would be faster than downloading from the driver, but
> I’m not sure I’d agree - it can go both ways.
>
> On a tangentially related note, one of the main reasons spark-ec2 is so slow
> to launch clusters is that it distributes files like the Spark binaries to
> all the workers via the master. Because of that, the launch time scaled with
> the number of workers requested.

It's true that there are side effects. But there are two things that
can be used to mitigate this:

- k8s uses docker images. Users can create docker images with all the
dependencies their app needs, and submit the app using that image.
Spark doesn't have yet documentation on how to create these customized
images, but I'd rather invest time on that instead of supporting this
init container approach.

- The original spark-on-k8s spec mentioned a "dependency server"
approach which sounded like a more generic version of the YARN
distributed cache, which I hope can be a different way of mitigating
that issue. With that work, we could build this functionality into
spark-submit itself and have other backends also benefit.

In general, forcing the download of dependencies on every invocation
of an app should be avoided.


Anirudh:
> what we lose in terms of separation of concerns

1500 less lines of code lower my level of concern a lot more.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to