[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291755#comment-14291755 ]
Murat Eken commented on SPARK-2389: ----------------------------------- Yes [~sowen], it's about HA for the driver. Our approach is to have a single app that's responsible for initializing the cache at start up (quite expensive) and then serve queries on that cached data (very fast). When you mention "N front-ends talking to a process built around one long running Spark app that can be done right now", are you referring to something like the spark-jobserver (or any alternative) I mentioned? If yes, the problem with that is the single point of failure, as we're moving that from the driver to the jobserver instance. Or is there something else we've missed? > globally shared SparkContext / shared Spark "application" > --------------------------------------------------------- > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org