Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm
Hi Prem, Spark actually does somewhat support different algorithms in CrossValidator, but it's not really obvious. You basically need to make a Pipeline and build a ParamGrid with different algorithms as stages. Here is an simple example: val dt = new DecisionTreeClassifier() .setLabelCol("label") .setFeaturesCol("features") val lr = new LogisticRegression() .setLabelCol("label") .setFeaturesCol("features") val pipeline = new Pipeline() val paramGrid = new ParamGridBuilder() .addGrid(pipeline.stages, Array(Array[PipelineStage](dt), Array[PipelineStage](lr))) val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) Although adding more params in the grid can get a little complicated - I discuss in detail here https://bryancutler.github.io/cv-pipelines/ As Patrick McCarthy mentioned, you might want to follow SPARK-19071 , specifically https://issues.apache.org/jira/browse/SPARK-19357 which parallelizes model evaluation. Bryan On Tue, Sep 5, 2017 at 8:02 AM, Yanbo Liang wrote: > You are right, native Spark MLlib CrossValidation can't run *different > *algorithms > in parallel. > > Thanks > Yanbo > > On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem > wrote: > >> Hi Yanboo, >> >> Thank You, I very much appreciate your help. >> >> For the current use case, the data can fit into a single node. So, >> spark-sklearn seems to be good choice. >> >> >> >> *I have on question regarding this * >> >> *“If no, Spark MLlib provide CrossValidation which can run multiple >> machine learning algorithms parallel on distributed dataset and do >> parameter search. >> FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>”* >> >> If I understand correctly, it can run parameter search for >> cross-validation in parallel. >> >> However, currently Spark does not support running multiple algorithms >> (like Naïve Bayes, Random Forest, etc.) in parallel. Am I correct? >> >> If not, could you please point me to some resources where they have run >> multiple algorithms in parallel. >> >> >> >> Thank You very much. It is great help, I will try spark-sklearn. >> >> Prem >> >> >> >> >> >> >> >> >> >> *From: *Yanbo Liang >> *Date: *Tuesday, September 5, 2017 at 10:40 AM >> *To: *Patrick McCarthy >> *Cc: *"Timsina, Prem" , "user@spark.apache.org" < >> user@spark.apache.org> >> *Subject: *Re: Apache Spark: Parallelization of Multiple Machine >> Learning ALgorithm >> >> >> >> Hi Prem, >> >> >> >> How large is your dataset? Can it be fitted in a single node? >> >> If no, Spark MLlib provide CrossValidation which can run multiple machine >> learning algorithms parallel on distributed dataset and do parameter >> search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cro >> ss-validation >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=> >> >> If yes, you can also try spark-sklearn, which can distribute multiple >> model training(single node training with sklearn) across a distributed >> cluster and do parameter search. FYI: https://github.com/databr >> icks/spark-sklearn >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=> >> >> >> >> Thanks >> >> Yanbo >> >> >> >> On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy >> wrote: >> >> You might benefit from watching this JIRA issue - >> https://issues.apache.org/jira/browse/SPARK-19071 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g
Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm
You are right, native Spark MLlib CrossValidation can't run *different *algorithms in parallel. Thanks Yanbo On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem wrote: > Hi Yanboo, > > Thank You, I very much appreciate your help. > > For the current use case, the data can fit into a single node. So, > spark-sklearn seems to be good choice. > > > > *I have on question regarding this * > > *“If no, Spark MLlib provide CrossValidation which can run multiple > machine learning algorithms parallel on distributed dataset and do > parameter search. > FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>”* > > If I understand correctly, it can run parameter search for > cross-validation in parallel. > > However, currently Spark does not support running multiple algorithms > (like Naïve Bayes, Random Forest, etc.) in parallel. Am I correct? > > If not, could you please point me to some resources where they have run > multiple algorithms in parallel. > > > > Thank You very much. It is great help, I will try spark-sklearn. > > Prem > > > > > > > > > > *From: *Yanbo Liang > *Date: *Tuesday, September 5, 2017 at 10:40 AM > *To: *Patrick McCarthy > *Cc: *"Timsina, Prem" , "user@spark.apache.org" < > user@spark.apache.org> > *Subject: *Re: Apache Spark: Parallelization of Multiple Machine Learning > ALgorithm > > > > Hi Prem, > > > > How large is your dataset? Can it be fitted in a single node? > > If no, Spark MLlib provide CrossValidation which can run multiple machine > learning algorithms parallel on distributed dataset and do parameter > search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html# > cross-validation > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=> > > If yes, you can also try spark-sklearn, which can distribute multiple > model training(single node training with sklearn) across a distributed > cluster and do parameter search. FYI: https://github.com/ > databricks/spark-sklearn > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=> > > > > Thanks > > Yanbo > > > > On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy > wrote: > > You might benefit from watching this JIRA issue - > https://issues.apache.org/jira/browse/SPARK-19071 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4&e=> > > > > On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem > wrote: > > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random > Forest, etc.) in parallel. > > 1) Validate each algorithm using 10-fold cross-validation > > B) Feed the output of step A) in second layer machine learning algorithm. > > My question is: > > Can we run multiple machine learning algorithm in step A in parallel? > > Can we do cross-validation in parallel? Like, run 10 iterations of Naive > Bayes training in parallel? > > > > I was not able to find any way to run the different algorithm in parallel. > And it seems cross-validation also can not be done in parallel. > > I appreciate any suggestion to parallelize this use case. > > > > Prem > > > > >
Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm
Hi Yanboo, Thank You, I very much appreciate your help. For the current use case, the data can fit into a single node. So, spark-sklearn seems to be good choice. I have on question regarding this “If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>” If I understand correctly, it can run parameter search for cross-validation in parallel. However, currently Spark does not support running multiple algorithms (like Naïve Bayes, Random Forest, etc.) in parallel. Am I correct? If not, could you please point me to some resources where they have run multiple algorithms in parallel. Thank You very much. It is great help, I will try spark-sklearn. Prem From: Yanbo Liang Date: Tuesday, September 5, 2017 at 10:40 AM To: Patrick McCarthy Cc: "Timsina, Prem" , "user@spark.apache.org" Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=> If yes, you can also try spark-sklearn, which can distribute multiple model training(single node training with sklearn) across a distributed cluster and do parameter search. FYI: https://github.com/databricks/spark-sklearn<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=> Thanks Yanbo On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy mailto:pmccar...@dstillery.com>> wrote: You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4&e=> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem mailto:prem.tims...@mssm.edu>> wrote: Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this: A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel. 1) Validate each algorithm using 10-fold cross-validation B) Feed the output of step A) in second layer machine learning algorithm. My question is: Can we run multiple machine learning algorithm in step A in parallel? Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes training in parallel? I was not able to find any way to run the different algorithm in parallel. And it seems cross-validation also can not be done in parallel. I appreciate any suggestion to parallelize this use case. Prem
Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm
Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation If yes, you can also try spark-sklearn, which can distribute multiple model training(single node training with sklearn) across a distributed cluster and do parameter search. FYI: https://github.com/databricks/spark-sklearn Thanks Yanbo On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy wrote: > You might benefit from watching this JIRA issue - > https://issues.apache.org/jira/browse/SPARK-19071 > > On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem > wrote: > >> Is there a way to parallelize multiple ML algorithms in Spark. My use >> case is something like this: >> >> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random >> Forest, etc.) in parallel. >> >> 1) Validate each algorithm using 10-fold cross-validation >> >> B) Feed the output of step A) in second layer machine learning algorithm. >> >> My question is: >> >> Can we run multiple machine learning algorithm in step A in parallel? >> >> Can we do cross-validation in parallel? Like, run 10 iterations of Naive >> Bayes training in parallel? >> >> >> >> I was not able to find any way to run the different algorithm in >> parallel. And it seems cross-validation also can not be done in parallel. >> >> I appreciate any suggestion to parallelize this use case. >> >> >> >> Prem >> > >
Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071 On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem wrote: > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random > Forest, etc.) in parallel. > > 1) Validate each algorithm using 10-fold cross-validation > > B) Feed the output of step A) in second layer machine learning algorithm. > > My question is: > > Can we run multiple machine learning algorithm in step A in parallel? > > Can we do cross-validation in parallel? Like, run 10 iterations of Naive > Bayes training in parallel? > > > > I was not able to find any way to run the different algorithm in parallel. > And it seems cross-validation also can not be done in parallel. > > I appreciate any suggestion to parallelize this use case. > > > > Prem >