Greetings,

We have an analytics workflow system in production.  This system is built in
Java and utilizes other services (including Apache Solr).  It works fine
with moderate level of data/processing load.  However, when the load goes
beyond certain limit (e.g., more than 10 million messages/documents) delays
start to show up.  No doubt this is a scalability issue, and Hadoop
ecosystem, especially Spark, can be handy in this situation.  The simplest
approach would be to rebuild the entire workflow using Spark, Kafka and
other components.  However, we decided to handle the problem in a couple of
phases.  In first phase we identified a few pain points (areas where
performance suffers most) and have started building corresponding mini Spark
applications (so as to take advantage of parallelism).

For now my question is: how can we instantiate/start our mini Spark jobs
"programmatically" (e.g., from Java applications)?  Only option I see in the
documentation is to run the jobs through command line (using spark-submit). 
Any insight in this area would be highly appreciated.

In longer term, I want to construct a collection of mini Spark applications
(each performing one specific task, similar to web services), and
architect/design bigger Spark based applications which in term will call
these mini Spark applications programmatically.  There is a possibility that
the Spark community has already started building such collection of
services.  Can you please provide some information/tips/best-practices in
this regard?

Cheers!
Ajay




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to