[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-08-19 Thread Ashish Rawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703595#comment-14703595
 ] 

Ashish Rawat commented on SPARK-2389:
-

Hi [~pwendell],

We are facing exactly the problem which you mentioned. And we are looking for 
exactly the same solution that you mentioned i.e. Driver HA :)

I have a few questions/comments on the perspective you shared:
1. "If we started to go down this path, we'd need to do things like define a 
standard serialization format for the RDD data, a global namespace for RDD's, 
persistence, etc. And then you're building a filesystem."
You only need to preserve the RDD metainfo and not the actual RDDs, so there 
should not be any complexity of serialization format for RDD data.
2. Although the cache data is recoverable, but how to reduce the latency of 
building back a cache of TBs, for a live application?
3. Can we not just prevent executors from shutting, preserve some important 
driver info and connect back the driver? Or provide a Hot standby for driver?

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-02-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327903#comment-14327903
 ] 

Patrick Wendell commented on SPARK-2389:


I've seen some variants of this question over time. Usually the set of 
requirements from the user is like this:

1. We are building a long lived application that uses a single Spark Context 
containing cached RDD's, and it dispatches requests from multiple users.
2. We want it to be fault tolerant, so using a single point of failure like the 
shared job server or single dispatcher isn't acceptable.
3. We don't want to pay the serialization cost of going to a filesystem like 
Tachyon, due to latency.

Then the "request" then whether we can make the driver program fault tolerant 
or somehow have the state of the active RDD's and execution in an ongoing Spark 
context stored persistently. Unfortunately, the RDD meta data and execution 
state in the driver is arbitrary state (a driver is just a Java process), and 
it's not possible to take any user program in Spark and make this state 
entirely recoverable on process failure. If we started to go down this path, 
we'd need to do things like define a standard serialization format for the RDD 
data, a global namespace for RDD's, persistence, etc. And then you're building 
a filesystem.

The real solution here is that applications need to provide resiliency 
themselves by architecting in a way where they either entirely keep state in a 
filesystem (and dispatch requests by reading from persistent storage), or they 
use caching in a way where that cache is soft state and can be recovered from 
some persistent storage if there is a failure, maybe with some temporary 
performance degradation. The Spark ecosystem already has H/A for some 
components, such as Streaming, and we achieved that by exploiting specifics of 
the architecture of a streaming program and allowing them to recover from 
checkpoints, etc.

In the future there will be a few major changes in Spark that make this whole 
thing much easier. The first is that we'll likely write extremely fast 
serializers for RDD's that have structure (SchemaRDD/DataFrame)... along with 
in-memory filesystems and formats that provide predicate pushdown and other 
optimizations, this will likely close the gap substantially between latency 
experienced for on-heap RDD's and those in persistent storage. Second, we may 
add H/A in other specific components of Spark, such as the JDBC server, where 
we can exploit specifics of the user-facing interface to allow fast 
recoverability. Then applications that write against those API's do not need to 
reason about H/A at all.

Hopefully that was a helpful perspective!

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291818#comment-14291818
 ] 

Robert Stupp commented on SPARK-2389:
-

[~srowen] yes, the problem is that drivers cannot share RDDs.
IMHO there are a lot of valid scenarios that can benefit from multiple drivers 
using shared RDDs.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291804#comment-14291804
 ] 

Sean Owen commented on SPARK-2389:
--

Yes, makes sense. Maxing out one driver isn't an issue since you can have many 
drivers (or push work into the cluster). The issue is really that each driver 
then has its own RDDs, and if you need 100s of drivers to keep up, that just 
won't work. (Although then I'd question how so much work is being done on the 
Spark driver?)

In theory the redundancy of all those RDDs is what HDFS caching and Tachyon 
could in theory help with, although those help share outside Spark. Whether 
that works for a particular use case right now is a different question, 
although I suspect it makes more sense to make those work than start yet 
another solution.

What you are describing -- mutating lots shared in-memory state -- doesn't 
sound like a problem Spark helps solve per se. That is, it doesn't sound like 
work that has to live in a Spark driver program, even if it needs to ask a 
Spark driver-based service for some results. Naturally you know your problem 
better than I, but I am wondering if the answer here isn't just using Spark 
differently, for what it's for.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291798#comment-14291798
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. fault tolerance when he mentions scalability

both play well together in a stateless application ;)

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291797#comment-14291797
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. That aside, why doesn't it scale?

Simply because it's just a single Spark client. If that machine's at its limit 
for whatever reason (VM memory, OS resources, CPU, network, ...), that's it.

Sure, you can run multiple drivers - but each has its own, private set of data.

IMO separate preloading is nice for some applications. But data is usually not 
immutable. By example:
* Imagine an application that provides offers for flights worldwide. It's a 
huge amount of data and a huge amount of processing. It cannot be simply 
preloaded - prices for tickets vary from minute to minute based on booking 
status etc etc etc
* Overall data set is quite big
* Overall load is too big for a single driver to handle - imagine thousands of 
offer requests per second
* Failure of a single driver is an absolute no-go
* All clients have to access the same set of data
* Preloading is just impossible during runtime (just at initial deployment)

So - a suitable approach would be to have:
* a Spark cluster holding all the RDDs and doing all offer and booking related 
operations
* a set of Spark clients to "abstract" Spark from the rest of the application
* a huge number of non-uniform frontend clients (could be web app servers, rich 
clients, SOAP / REST frontends)
* everything (except the data) stateless

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291795#comment-14291795
 ] 

Murat Eken commented on SPARK-2389:
---

[~sowen], I think Robert is talking about fault tolerance when he mentions 
scalability. Anyway, as I mentioned in my original comment, Tachyon is not an 
option, at least for us, due to interprocess serialization/deserialization 
costs. Although we haven't tried HDFS, but I would be surprised if that 
performed differently.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291788#comment-14291788
 ] 

Sean Owen commented on SPARK-2389:
--

Yes, the SPOF problem makes sense. It doesn't seem to be what this JIRA was 
about though, which seems to be what the jobserver-style approach addresses.

That aside, why doesn't it scale? because of work that needs to be done on the 
driver? You can of course still run a bunch of drivers, just not one per client.

The preloading cache issue is what off-heap caching in Tachyon or HDFS is 
supposed to ameliorate.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291764#comment-14291764
 ] 

Robert Stupp commented on SPARK-2389:
-

[~srowen] that *one long-running* Spark app is the problem. It's a SPOF and it 
does not scale.

It would great to have *some* Spark apps sharing the same data set (thus 
reducing load to backend data stores and benefit from RDD caching).
The "real" clients could then talk (via REST or whatever the "real" application 
does) to these Spark apps.
I don't have anything in mind regarding a "shared sessions" - I'd like to 
"just" have multiple spark clients access the same RDDs.

As [~meken] points out, you have to preload the cache first (which is 
expensive) before you can use it.
(I don't mind to have that cached data considered immutable.)

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291755#comment-14291755
 ] 

Murat Eken commented on SPARK-2389:
---

Yes [~sowen], it's about HA for the driver. Our approach is to have a single 
app that's responsible for initializing the cache at start up (quite expensive) 
and then serve queries on that cached data (very fast).

When you mention  "N front-ends talking to a process built around one long 
running Spark app that can be done right now", are you referring to something 
like the spark-jobserver (or any alternative) I mentioned? If yes, the problem 
with that is the single point of failure, as we're moving that from the driver 
to the jobserver instance. Or is there something else we've missed?

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291737#comment-14291737
 ] 

Sean Owen commented on SPARK-2389:
--

Why can't N front-ends talk to a process built around one long-running Spark 
app? I think that's what the OP is talking about, and can be done right now. 
One Spark app having many contexts doesn't quite make sense as an app is a 
SparkContext.

But [~meken] are you really talking about HA for the driver?

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291724#comment-14291724
 ] 

Murat Eken commented on SPARK-2389:
---

+1. We're using a Spark cluster as a real-time query engine, and unfortunately 
we're running into the same issues as Robert mentions. Although Spark provides 
a plethora of solutions when it comes to making its cluster fault-tolerant and 
resilient, we need the same resilience for the front layer, from where the 
Spark cluster is accessed; meaning multiple instances of Spark clients, hence 
multiple SparkContexts from those clients connecting to the same cluster with 
the same computing power.

Performance is crucial for us, hence our choice for caching the data in memory 
and utilizing the full hardware resources in the executors. Alternative 
solutions, such as using Tachyon for the data, and restarting executors for 
each query just don't give the same performance. We're looking into using 
https://github.com/spark-jobserver/spark-jobserver but that's not a proper 
solution as we still would have the jobserver as a single point of failure in 
our setup, which is a problem for us.

I'd appreciate it if a Spark developer could give some information about the 
feasibility of this change request; if this turns out to be difficult or even 
impossible due to the choices made in the architecture, it would be good to 
know that so that we can consider our alternatives.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org