Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation If yes, you can also try spark-sklearn, which can distribute multiple model training(single node training with sklearn) across a distributed cluster and do parameter search. FYI: https://github.com/databricks/spark-sklearn
Thanks Yanbo On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmccar...@dstillery.com> wrote: > You might benefit from watching this JIRA issue - > https://issues.apache.org/jira/browse/SPARK-19071 > > On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem <prem.tims...@mssm.edu> > wrote: > >> Is there a way to parallelize multiple ML algorithms in Spark. My use >> case is something like this: >> >> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random >> Forest, etc.) in parallel. >> >> 1) Validate each algorithm using 10-fold cross-validation >> >> B) Feed the output of step A) in second layer machine learning algorithm. >> >> My question is: >> >> Can we run multiple machine learning algorithm in step A in parallel? >> >> Can we do cross-validation in parallel? Like, run 10 iterations of Naive >> Bayes training in parallel? >> >> >> >> I was not able to find any way to run the different algorithm in >> parallel. And it seems cross-validation also can not be done in parallel. >> >> I appreciate any suggestion to parallelize this use case. >> >> >> >> Prem >> > >