Hi Yanboo,
Thank You, I very much appreciate your help.
For the current use case, the data can fit into a single node. So, 
spark-sklearn seems to be good choice.

I have  on question regarding this
“If no, Spark MLlib provide CrossValidation which can run multiple machine 
learning algorithms parallel on distributed dataset and do parameter search. 
FYI: 
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>”
If I understand correctly, it can run parameter search for cross-validation in 
parallel.
However,  currently  Spark does not support  running multiple algorithms (like 
Naïve Bayes,  Random Forest, etc.) in parallel. Am I correct?
If not, could you please point me to some resources where they have run 
multiple algorithms in parallel.

Thank You very much. It is great help, I will try spark-sklearn.
Prem




From: Yanbo Liang <[email protected]>
Date: Tuesday, September 5, 2017 at 10:40 AM
To: Patrick McCarthy <[email protected]>
Cc: "Timsina, Prem" <[email protected]>, "[email protected]" 
<[email protected]>
Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning 
ALgorithm

Hi Prem,

How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine 
learning algorithms parallel on distributed dataset and do parameter search. 
FYI: 
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>
If yes, you can also try spark-sklearn, which can distribute multiple model 
training(single node training with sklearn) across a distributed cluster and do 
parameter search. FYI: 
https://github.com/databricks/spark-sklearn<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=>

Thanks
Yanbo

On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy 
<[email protected]<mailto:[email protected]>> wrote:
You might benefit from watching this JIRA issue - 
https://issues.apache.org/jira/browse/SPARK-19071<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4&e=>

On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem 
<[email protected]<mailto:[email protected]>> wrote:
Is there a way to parallelize multiple ML algorithms in Spark. My use case is 
something like this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, 
etc.) in parallel.
1) Validate each algorithm using 10-fold cross-validation
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes 
training in parallel?

I was not able to find any way to run the different algorithm in parallel. And 
it seems cross-validation also can not be done in parallel.
I appreciate any suggestion to parallelize this use case.

Prem


Reply via email to