Marcelo, I can see that we might be misunderstanding what this change
implies for performance and some of the deeper implementation details here.
We have a community meeting tomorrow (at 10am PT), and we'll be sure to
explore this idea in detail, and understand the implications and then get
back to you.

Thanks for the detailed responses here, and for spending time with the idea.
(Also, you're more than welcome to attend the meeting - there's a link
<https://github.com/kubernetes/community/tree/master/sig-big-data> here if
you're around.)

Cheers,
Anirudh


On Jan 9, 2018 8:05 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote:

One thing I forgot in my previous e-mail is that if a resource is
remote I'm pretty sure (but haven't double checked the code) that
executors will download it directly from the remote server, and not
from the driver. So there, distributed download without an init
container.

On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li <liyinan...@gmail.com> wrote:
> The init-container is required for use with the resource staging server
> (https://github.com/apache-spark-on-k8s/userdocs/blob/master
/src/jekyll/running-on-kubernetes.md#resource-staging-server).

If the staging server *requires* an init container you have already a
design problem right there.

> Additionally, the init-container is a Kubernetes
> native way of making sure that the dependencies are localized

Sorry, but the init container does not do anything by itself. You had
to add a whole bunch of code to execute the existing Spark code in an
init container, when not doing it would have achieved the exact same
goal much more easily, in a way that is consistent with how Spark
already does things.

Matt:
> the executors wouldn’t receive the jars on their class loader until after
the executor starts

I actually consider that a benefit. It means spark-on-k8s application
will behave more like all the other backends, where that is true also
(application jars live in a separate class loader).

> traditionally meant to prepare the environment for the application that
is to be run

You guys are forcing this argument when it all depends on where you
draw the line. Spark can be launched without downloading any of those
dependencies, because Spark will download them for you. Forcing the
"kubernetes way" just means you're writing a lot more code, and
breaking the Spark app initialization into multiple container
invocations, to achieve the same thing.

> would make the SparkSubmit code inadvertently allow running client mode
Kubernetes applications as well

Not necessarily. I have that in my patch; it doesn't allow client mode
unless a property that only the cluster mode submission code sets is
present. If some user wants to hack their way around that, more power
to them; users can also compile their own Spark without the checks if
they want to try out client mode in some way.

Anirudh:
> Telling users that they must rebuild images  ... every time seems less
than convincing to me.

Sure, I'm not proposing people use the docker image approach all the
time. It would be a hassle while developing an app, as it is kind of a
hassle today where the code doesn't upload local files to the k8s
cluster.

But it's perfectly reasonable for people to optimize a production app
by bundling the app into a pre-built docker image to avoid
re-downloading resources every time. Like they'd probably place the
jar + dependencies on HDFS today with YARN, to get the benefits of the
YARN cache.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to