It's true that CrossValidator is not parallel currently - see https://issues.apache.org/jira/browse/SPARK-19357 and feel free to help review.
On Fri, 7 Apr 2017 at 14:18 Aseem Bansal <[email protected]> wrote: > > - Limited the data to 100,000 records. > - 6 categorical feature which go through imputation, string indexing, > one hot encoding. The maximum classes for the feature is 100. As data is > imputated it becomes dense. > - 1 numerical feature. > - Training Logistic Regression through CrossValidation with grid to > optimize its regularization parameter over the values 0.0001, 0.001, 0.005, > 0.01, 0.05, 0.1 > - Using spark's launcher api to launch it on a yarn cluster in Amazon > AWS. > > I was thinking that as CrossValidator is finding the best parameters it > should be able to run them independently. That sounds like something which > could be ran in parallel. > > > On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <[email protected]> > wrote: > > What is the size of training data (number examples, number features)? > Dense or sparse features? How many classes? > > What commands are you using to submit your job via spark-submit? > > On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <[email protected]> wrote: > > When using spark ml's LogisticRegression, RandomForest, CrossValidator > etc. do we need to give any consideration while coding in making it scale > with more CPUs or does it scale automatically? > > I am reading some data from S3, using a pipeline to train a model. I am > running the job on a spark cluster with 36 cores and 60GB RAM and I cannot > see much usage. It is running but I was expecting spark to use all RAM > available and make it faster. So that's why I was thinking whether we need > to take something particular in consideration or wrong expectations? > > >
