Greetings, We have an analytics workflow system in production. This system is built in Java and utilizes other services (including Apache Solr). It works fine with moderate level of data/processing load. However, when the load goes beyond certain limit (e.g., more than 10 million messages/documents) delays start to show up. No doubt this is a scalability issue, and Hadoop ecosystem, especially Spark, can be handy in this situation. The simplest approach would be to rebuild the entire workflow using Spark, Kafka and other components. However, we decided to handle the problem in a couple of phases. In first phase we identified a few pain points (areas where performance suffers most) and have started building corresponding mini Spark applications (so as to take advantage of parallelism).
For now my question is: how can we instantiate/start our mini Spark jobs "programmatically" (e.g., from Java applications)? Only option I see in the documentation is to run the jobs through command line (using spark-submit). Any insight in this area would be highly appreciated. In longer term, I want to construct a collection of mini Spark applications (each performing one specific task, similar to web services), and architect/design bigger Spark based applications which in term will call these mini Spark applications programmatically. There is a possibility that the Spark community has already started building such collection of services. Can you please provide some information/tips/best-practices in this regard? Cheers! Ajay -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org