[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

Sean Owen (JIRA) Mon, 26 Jan 2015 05:20:11 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291804#comment-14291804
 ]


Sean Owen commented on SPARK-2389:
----------------------------------

Yes, makes sense. Maxing out one driver isn't an issue since you can have many 
drivers (or push work into the cluster). The issue is really that each driver 
then has its own RDDs, and if you need 100s of drivers to keep up, that just 
won't work. (Although then I'd question how so much work is being done on the 
Spark driver?)

In theory the redundancy of all those RDDs is what HDFS caching and Tachyon 
could in theory help with, although those help share outside Spark. Whether 
that works for a particular use case right now is a different question, 
although I suspect it makes more sense to make those work than start yet 
another solution.

What you are describing -- mutating lots shared in-memory state -- doesn't 
sound like a problem Spark helps solve per se. That is, it doesn't sound like 
work that has to live in a Spark driver program, even if it needs to ask a 
Spark driver-based service for some results. Naturally you know your problem 
better than I, but I am wondering if the answer here isn't just using Spark 
differently, for what it's for.

> globally shared SparkContext / shared Spark "application"
> ---------------------------------------------------------
>
>                 Key: SPARK-2389
>                 URL: https://issues.apache.org/jira/browse/SPARK-2389
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

Reply via email to