Hi Ashwin,

Let me try to answer to the best of my knowledge.

On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
<ashwinshanka...@gmail.com> wrote:
> Here are my questions :
> 1. Sharing spark context : How exactly multiple users can share the cluster
> using same spark
>     context ?

That's not something you might want to do usually. In general, a
SparkContext maps to a user application, so each user would submit
their own job which would create its own SparkContext.

If you want to go outside of Spark, there are project which allow you
to manage SparkContext instances outside of applications and
potentially share them, such as
https://github.com/spark-jobserver/spark-jobserver. But be sure you
actually need it - since you haven't really explained the use case,
it's not very clear.

> 2. Different spark context in YARN: assuming I have a YARN cluster with
> queues and preemption
>     configured. Are there problems if executors/containers of a spark app
> are preempted to allow a
>     high priority spark app to execute ?

As far as I understand, this will cause executors to be killed, which
means that Spark will start retrying tasks to rebuild the data that
was held by those executors when needed. Yarn mode does have a
configurable upper limit on the number of executor failures, so if
your jobs keeps getting preempted it will eventually fail (unless you
tweak the settings).

I don't recall whether Yarn has an API to cleanly allow clients to
stop executors when preempted, but even if it does, I don't think
that's supported in Spark at the moment.

> How are user names passed on from spark to yarn(say I'm
> using nested user queues feature in fair scheduler) ?

Spark will try to run the job as the requesting user; if you're not
using Kerberos, that means the process themselves will be run as
whatever user runs the Yarn daemons, but the Spark app will be run
inside a "UserGroupInformation.doAs()" call as the requesting user. So
technically nested queues should work as expected.

> 3. Sharing RDDs in 1 and 2 above ?

I'll assume you don't mean actually sharing RDDs in the same context,
but between different SparkContext instances. You might (big might
here) be able to checkpoint an RDD from one context and load it from
another context; that's actually like some HA-like features for Spark
drivers are being addressed.

The job server I mentioned before, which allows different apps to
share the same Spark context, has a feature to share RDDs by name,
also, without having to resort to checkpointing.

Hope this helps!

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to