Hi,

I have a couple of use cases for Apache Spark applications/scripts,
generally of the following form:

   *General ETL use case* - more specifically a transformation of a
Cassandra column family containing many events (think event sourcing) into
various aggregated column families.

   *Streaming use case* - realtime analysis of the events as they arrive in
the system.

For *(1)*, I'll need to kick off the Spark application periodically.

For *(2)*, just kick off the long running Spark Streaming process at boot
time and let it go.

/(Note - I'm using Spark Standalone as the cluster manager, so no yarn or
mesos)/

I'm trying to figure out the most common / best practice deployment
strategies for Spark applications.

So far the options I can see are:

1. *Deploying my program as a jar, and running the various tasks with
spark-submit* - which seems to be the way recommended in the spark docs.
Some thoughts about this strategy:

   * how do you start/stop tasks - just using simple bash scripts?
   * how is scheduling managed? - simply use cron?
   * any resilience? (e.g. Who schedules the jobs to run if the driver
server dies?)
   
2. *Creating a separate webapp as the driver program.*

   * creates a spark context programmatically to talk to the spark cluster
   * allowing users to kick off tasks through the http interface
   * using Quartz (for example) to manage scheduling could use cluster with
zookeeper election for resilience

3. *Spark job server (https://github.com/ooyala/spark-jobserver)*

   * I don't think there's much benefit over (2) for me, as I don't (yet)
have many teams and projects talking to Spark, and would still need some app
to talk to job server anyway 
   * no scheduling built in as far as I can see 

I'd like to understand the general consensus w.r.t a simple but robust
deployment strategy I haven't been able to determine one by trawling the
web, as of yet.

Thanks very much!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-application-deployment-best-practices-tp23036.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to