Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

Yanbo Liang Tue, 05 Sep 2017 08:04:07 -0700

You are right, native Spark MLlib CrossValidation can't run *different
*algorithms
in parallel.


Thanks
Yanbo

On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem <[email protected]>
wrote:

> Hi Yanboo,
>
> Thank You, I very much appreciate your help.
>
> For the current use case, the data can fit into a single node. So,
> spark-sklearn seems to be good choice.
>
>
>
> *I have  on question regarding this *
>
> *“If no, Spark MLlib provide CrossValidation which can run multiple
> machine learning algorithms parallel on distributed dataset and do
> parameter search.
> FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>”*
>
> If I understand correctly, it can run parameter search for
> cross-validation in parallel.
>
> However,  currently  Spark does not support  running multiple algorithms
> (like Naïve Bayes,  Random Forest, etc.) in parallel. Am I correct?
>
> If not, could you please point me to some resources where they have run
> multiple algorithms in parallel.
>
>
>
> Thank You very much. It is great help, I will try spark-sklearn.
>
> Prem
>
>
>
>
>
>
>
>
>
> *From: *Yanbo Liang <[email protected]>
> *Date: *Tuesday, September 5, 2017 at 10:40 AM
> *To: *Patrick McCarthy <[email protected]>
> *Cc: *"Timsina, Prem" <[email protected]>, "[email protected]" <
> [email protected]>
> *Subject: *Re: Apache Spark: Parallelization of Multiple Machine Learning
> ALgorithm
>
>
>
> Hi Prem,
>
>
>
> How large is your dataset? Can it be fitted in a single node?
>
> If no, Spark MLlib provide CrossValidation which can run multiple machine
> learning algorithms parallel on distributed dataset and do parameter
> search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#
> cross-validation
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>
>
> If yes, you can also try spark-sklearn, which can distribute multiple
> model training(single node training with sklearn) across a distributed
> cluster and do parameter search. FYI: https://github.com/
> databricks/spark-sklearn
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=>
>
>
>
> Thanks
>
> Yanbo
>
>
>
> On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <[email protected]>
> wrote:
>
> You might benefit from watching this JIRA issue -
> https://issues.apache.org/jira/browse/SPARK-19071
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4&e=>
>
>
>
> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem <[email protected]>
> wrote:
>
> Is there a way to parallelize multiple ML algorithms in Spark. My use case
> is something like this:
>
> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random
> Forest, etc.) in parallel.
>
> 1) Validate each algorithm using 10-fold cross-validation
>
> B) Feed the output of step A) in second layer machine learning algorithm.
>
> My question is:
>
> Can we run multiple machine learning algorithm in step A in parallel?
>
> Can we do cross-validation in parallel? Like, run 10 iterations of Naive
> Bayes training in parallel?
>
>
>
> I was not able to find any way to run the different algorithm in parallel.
> And it seems cross-validation also can not be done in parallel.
>
> I appreciate any suggestion to parallelize this use case.
>
>
>
> Prem
>
>
>
>
>

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

Reply via email to