In order to create an application that executes code on Spark we have a
long lived process. It periodically runs jobs programmatically on a Spark
cluster, meaning it does not use spark-submit. The Jobs it executes have
varying requirements for memory so we want to have the Spark Driver run in
the cluster.

This kind of architecture does not work very well with Spark as we
understand it. The issue is that there is no way to run in
deployMode=cluster. This setting is ignored when launching a jobs
programmatically (why is it not an exception?). This in turn means that our
launching application needs to be run on a machine that is big enough to
run the worst case Spark Driver. This is completely impractical due to our
use case (a generic always on Machine Learning Server).

What we would rather do is have the Scala closure that has access to the
Spark Context be treated as the Spark Driver and run in the cluster. There
seems to be no way to do this with off-the-shelf Spark.

This seems like a very common use case but maybe we are too close to it. We
are aware of the Job Server and Apache Livy, which seem to give us what we
need.

Are these the best solutions? Is there a way to do what we want without
spark-submit? Have others here solved this in some other way?

Reply via email to