[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-21 Thread petro-rudenko
Github user petro-rudenko closed the pull request at: https://github.com/apache/spark/pull/5510 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-20 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-94419332 For my case it means: ```scala (new ParamGridBuilder).addGrid(lr.regParam, Array(0.1)) == (lr.regParam=0.1 && new ParamGridBuild

[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-20 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-94418041 For my case i can live with default behaviour. It's just not intuitive that empty ParamGridBuilder returns array of size 1 and also not clear how to handle j

[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-15 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-93412249 Ideally crossvalidator should handle next cases: 1) No parameters at all: just run est.fit(dataset, new ParamMap) 2) 1 param: set this param to estimator

[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-15 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-93373411 Maybe in Crossvalidator handle empty estimatorParamMap? ```scala /** @group setParam */ def setEstimatorParamMaps(value: Array[ParamMap]): this.type

[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-14 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5510#discussion_r28339279 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamGridBuilder.scala --- @@ -100,10 +100,11 @@ class ParamGridBuilder { * Builds

[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-14 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/5510 [SPARK-6901][Ml]ParamGridBuilder.build with no grids should return an emty array ParamGridBuilder.build with no grids returns array with an empty param map. ```scala assert((new

[GitHub] spark pull request: [SPARK-2991] Implement RDD lazy transforms for...

2015-04-06 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/1909#issuecomment-90063723 +1 for this. Useful feature to calculate distributed cumulative sum. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-5885][MLLIB] Add VectorAssembler as a f...

2015-04-03 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5196#discussion_r27739585 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-5885][MLLIB] Add VectorAssembler as a f...

2015-04-02 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5196#discussion_r27645880 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer

2015-03-31 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/4735#discussion_r27510186 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer

2015-03-31 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/4735#discussion_r27486767 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer

2015-03-30 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/4735#discussion_r27399968 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-6608] [SQL] Makes DataFrame.rdd a lazy ...

2015-03-30 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5265#issuecomment-87670835 +1 for this, since for example [the caching logic from ml package](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml

[GitHub] spark pull request: [ML][docs][minor] Define LabeledDocument/Docum...

2015-03-24 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5135#discussion_r27043852 --- Diff: docs/ml-guide.md --- @@ -655,6 +660,36 @@ import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import

[GitHub] spark pull request: [ML][docs][minor] Define LabeledDocument/Docum...

2015-03-23 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/5135 [ML][docs][minor] Define LabeledDocument/Document classes in CV example To easier copy/paste Cross-Validation example code snippet need to define LabeledDocument/Document in it, since they

[GitHub] spark pull request: SPARK-4682 [CORE] Consolidate various 'Clock' ...

2015-02-25 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4514#issuecomment-75994711 Thanks, works now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: SPARK-4682 [CORE] Consolidate various 'Clock' ...

2015-02-25 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4514#issuecomment-75989874 Having problem compiling spark with sbt due to next error: ``` $ build/sbt -Phadoop-2.4 compile [error] /home/peter/soft/spark_src/core/src/main/scala

[GitHub] spark pull request: [SPARK-5802][MLLIB] cache transformed data in ...

2015-02-23 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4593#issuecomment-75550855 @dbtsai, @joshdevins here's an issue i have. I'm using new ml pipeline with hyperparameter grid search. Because folds doesn't depend from hyper

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-02-16 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-74563955 @jkbradley i can setValidateData in GLM, but not in the LogisticRegression class from the new API. For my case found a trick to customize anything i want (add

[GitHub] spark pull request: [Ml] SPARK-5804 Explicitly manage cache in Cro...

2015-02-13 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/4595 [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop

[GitHub] spark pull request: [Ml] SPARK-5796 Don't transform data on a last...

2015-02-13 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/4590 [Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline If it's a last estimator in Pipeline there's no need to transform data, since there's no next stag

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-02-09 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-73509087 One more issue. In LogisticRegressionWithLBFGS class there's a line: ```scala this.setFeatureScaling(true) ``` I have feature scaling as a

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-27 Thread petro-rudenko
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-71636977 Also would be nice to be able to get/set model state: ```scala // Run cross-validation, and choose the best set of parameters. val cvModel = crossval.fit