Hello Spark community, I would like to write an application in Scala that i a model server. It should have an MLlib Linear Regression model that is already trained on some big set of data, and then is able to repeatedly call myLinearRegressionModel.predict() many times and return the result.
Now, I want this client application to submit a job to Spark and tell the Spark cluster job to 1) train its particular MLlib model, which produces a LinearRegression model, and then 2) take the produced Scala org.apache.spark.mllib.regression.LinearRegressionModel *object*, serialize that object, and return this serialized object over the wire to my calling application. 3) My client application receives the serialized Scala (model) object, and can call .predict() on it over and over. I am separating the heavy lifting of training the model and doing model predictions; the client application will only do predictions using the MLlib model it received from the Spark application. The confusion I have is that I only know how to "submit jobs to Spark" by using the bin/spark-submit script, and then the only output I receive is stdout (as in, text). I want my scala appliction to hopefully submit the spark model-training programmatically, and for the Spark application to return a SERIALIZED MLLIB OBJECT, not just some stdout text! How can I do this? I think my use case of separating long-running jobs to Spark and using it's libraries in another application should be a pretty common design pattern. Thanks! -- Άρης Βλασακάκης Aris Vlasakakis