Re: Spark 2.1 ml library scalability

Nick Pentreath Fri, 07 Apr 2017 05:23:44 -0700

It's true that CrossValidator is not parallel currently - see
https://issues.apache.org/jira/browse/SPARK-19357 and feel free to help
review.


On Fri, 7 Apr 2017 at 14:18 Aseem Bansal <[email protected]> wrote:

>
>    - Limited the data to 100,000 records.
>    - 6 categorical feature which go through imputation, string indexing,
>    one hot encoding. The maximum classes for the feature is 100. As data is
>    imputated it becomes dense.
>    - 1 numerical feature.
>    - Training Logistic Regression through CrossValidation with grid to
>    optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
>    0.01, 0.05, 0.1
>    - Using spark's launcher api to launch it on a yarn cluster in Amazon
>    AWS.
>
> I was thinking that as CrossValidator is finding the best parameters it
> should be able to run them independently. That sounds like something which
> could be ran in parallel.
>
>
> On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <[email protected]>
> wrote:
>
> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <[email protected]> wrote:
>
> When using spark ml's LogisticRegression, RandomForest, CrossValidator
> etc. do we need to give any consideration while coding in making it scale
> with more CPUs or does it scale automatically?
>
> I am reading some data from S3, using a pipeline to train a model. I am
> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
> see much usage. It is running but I was expecting spark to use all RAM
> available and make it faster. So that's why I was thinking whether we need
> to take something particular in consideration or wrong expectations?
>
>
>

Re: Spark 2.1 ml library scalability

Reply via email to