subject:"\[GitHub\] spark pull request\: \[WIP\]\[SPARK\-3530\]\[MLLIB\] pipeline and paramete..."

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62242562
  
I will make one under `org.apache.spark.examples` (though it is still under 
`spark`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mateiz

Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62241597
  
This is a good point, maybe we should have an example or a test class that 
is not in the org.apache.spark.ml package and tries to create a pipeline stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk

Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62234569
  
Is there anything that can be done about these two lines that appear in 
every transform and fit function:
```scala
import dataset.sqlContext._
val map = this.paramMap ++ paramMap
```
Beyond being boilerplate, the second line is an issue because at least last 
I checked, ++ in ParamMap is marked as private[ml], so it can't be used by 
custom pipeline stages.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62230172
  
@shivaram @tomerk I updated this PR and removed `modelParams`. The param 
traits are marked as `private[ml]`. As I mentioned, I have no particular 
preference here.

I also added a simple text classification pipeline with (simple) tokenizer 
and hashingTF.

For the codegen of getters/setters, I asked @heathermiller and @marmbrust . 
The answer is no at this time. The `@BeanProperty` logic is hard coded in the 
Scala compiler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62227735
  
  [Test build #23077 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull)
 for   PR 3099 at commit 
[`7772430`](https://github.com/apache/spark/commit/77724301621df221517cd0211c06ba6cafc9f661).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class Estimator[M <: Model] extends PipelineStage with Params 
`
  * `abstract class Evaluator extends Identifiable `
  * `trait Identifiable extends Serializable `
  * `abstract class Model extends Transformer `
  * `class Pipeline extends Estimator[PipelineModel] `
  * `class PipelineModel(`
  * `abstract class Transformer extends PipelineStage with Params `
  * `trait CrossValidatorParams extends Params `
  * `class CrossValidator extends Estimator[CrossValidatorModel] with 
CrossValidatorParams with Logging `
  * `class HashingTF extends Transformer with HasInputCol with HasOutputCol 
`
  * `class LogisticRegression extends Estimator[LogisticRegressionModel] 
with LogisticRegressionParams `
  * `class StandardScaler extends Estimator[StandardScalerModel] with 
StandardScalerParams `
  * `class Tokenizer extends Transformer with HasInputCol with HasOutputCol 
`
  * `class Param[T] (`
  * `class DoubleParam(parent: Params, name: String, doc: String, default: 
Option[Double] = None)`
  * `class IntParam(parent: Params, name: String, doc: String, default: 
Option[Int] = None)`
  * `class FloatParam(parent: Params, name: String, doc: String, default: 
Option[Float] = None)`
  * `class LongParam(parent: Params, name: String, doc: String, default: 
Option[Long] = None)`
  * `class BooleanParam(parent: Params, name: String, doc: String, default: 
Option[Boolean] = None)`
  * `case class ParamPair[T](param: Param[T], value: T)`
  * `trait Params extends Identifiable `
  * `class ParamGridBuilder `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62227719
  
  [Test build #23077 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull)
 for   PR 3099 at commit 
[`7772430`](https://github.com/apache/spark/commit/77724301621df221517cd0211c06ba6cafc9f661).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62229656

Thanks for the replies. I'm summarizing the discussion so far and including
some replies. I think we are getting close to a good API ! IMHO it would be
good to have the developer API updates as well and test a couple of more
pipelines before we push this out.

Dataset
1. We all seem to agree that there is a need for higher level Dataset API
or more flexible functions in SchemaRDD for new operations. `appendCol` seems
like it'll be very useful for one.
2. We also seem to have agreement that we will have an additional, simpler
Transformer API from RDD to RDD. The one using Datasets will still be there for
cases where the simpler API isn't enough
Also I am not sure I fully understand the difference between the User API
and the Developer API (developers are the main users for Pipelines ?).

Pipelines
1. Batching, Loops etc: @mengxr -- I'll try out the parallel pipeline +
feature assembler and let you know how it goes. One other requirement that we
have for large pipelines is that we lazily evaluate each batch to fit in memory
and use boosting. I'll try to see if we can do something similar here and get
back to you.

2. Constructors, Params: @mateiz -- Completely agree on the binary
compatibility problems and the trouble we had with default constructor args.
However I think some members are required as a part of the class vs. being
optional parameters. For example, the regularization value is definitely a
parameter with a default value of 0.0 and we want to tune its value etc. On the
other hand in say
[HadamardProduct](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/RandomSignNode.scala#L19)
or even
[LinearModel](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MultiClassLinearRegressionEstimator.scala#L81)
the signs, weights are a part of the object. You almost never want to replace
these values as doing so would result in creating a completely different
pipeline. So I think there are roles for both -- and we need to be careful
especially when we are saving and loading pipelines etc.

3. Evaluators: I can see that these are useful for model selection inside
estimators, but as @jkbradley said we need to figure out a better way to chain
them to a pipeline. FWIW my example was very simple and just trying to compute
test error for a single model and not doing any model selection.

4. Parameter setters, passing, maps etc -- We seemed to have reached a nice
design point on this ! I agree that implicit mapping was a bit tedious and
`map(param)` is fine.

5. Parameter traits like HasInputCol -- This is the one issue where we
don't have great ideas so far I guess. On the one hand having too many traits
seems wasteful. On the other hand the amount of cruft code without them is also
tedious. One idea I had was to try out annotations (like @Param featureCol:
String) and auto generating code for setters, getters. More knowledgeable
Java/Scala people may know more. (@JoshRosen ?)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62227737
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23077/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mateiz

Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62223666
  
On the parameters vs constructors, the problem with default args to 
constructors is that they make it very difficult to add settings to an 
algorithm. Such changes will break binary compatibility in Scala and may also 
break calls to the algorithm if you change the order. In the current MLlib API 
I believe all the constructors are package-private because of this.

The other thing with setters is that the API looks more similar in Java to 
the other two languages. This makes it easier for people to learn Spark by 
grabbing snippets of code and moving them across languages. Of course it has to 
be weighed against utility though, if this stuff gets out of hand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62207734
  
I'd like to second the thanks to you both for trying out the new API!  Some 
thoughts:

@shivaram About your Dataset API comments:
As @mengxr said, I am planning several abstractions which should help with 
the boilerplate.  I agree with your proposal for sticking with familiar 
RDD[MyType] API where possible and letting abstractions handle the boilerplate 
of working with SchemaRDD.  When that is not possible, I still hope to provide 
some helper functions to reduce boilerplate.
I have this set of classes partly sketched out and will send a WIP PR once 
this PR gets merged.

@shivaram About your Pipelines API comments:
* Loops in a pipeline: What @mengxr suggested might work for the FFT thing, 
but general Pipelines with cycles, etc. are definitely future work.
* Parameters vs. Constructors: Instinctively, I agree about having at least 
some parameters specified in a constructor, especially when they are required 
parameters (e.g., the Estimator for CrossValidation).  However, @mengxr 
convinced me that it makes things difficult.  E.g., for CrossValidation, you 
really don't want a CV instance to be tied to a particular estimator since you 
may want to run CV to choose between several Estimators.
* Chaining evaluators to a Pipeline: Initially, the 2 ways to get 
evaluations will be to look at Transformers created by fitting Estimators (to 
see training evaluation metrics) and to compute metrics on your own using the 
new columns in the SchemaRDD produced by transform (to get test metrics).  
Later on, it would be great to allow users to insert Evaluators into Pipelines, 
to compute custom metrics more easily.

@tomerk  About a few comments:
* "There are a lot of parameter traits": I too am ambivalent here.  It may 
save a little code duplication, but may also discourage people from writing 
customized documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk

Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3099#discussion_r20034471
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/example/StandardScaler.scala ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.example
+
+import org.apache.spark.ml._
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.feature
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.sql.SchemaRDD
+import org.apache.spark.sql.catalyst.analysis.Star
+import org.apache.spark.sql.catalyst.dsl._
+import org.apache.spark.sql.catalyst.expressions.Row
+
+class StandardScaler extends Estimator[StandardScalerModel] with 
HasInputCol {
+
+  def setInputCol(value: String): this.type = { set(inputCol, value); this 
}
+
+  override val modelParams: StandardScalerModelParams = new 
StandardScalerModelParams {}
+
+  override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
StandardScalerModel = {
+import dataset.sqlContext._
+val map = this.paramMap ++ paramMap
--- End diff --

This bit of boilerplate that appears in every transform and fit 
implementation should probably be moved into the interface somehow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62199983
  
@tomerk 

> At @shivaram's suggestion, I started porting over a simple text 
classifier pipeline that was already using an Estimator/Transformer abstraction 
of RDD[U] to RDD[V] transforms to this interface. The almost-complete port (the 
imports got messed up when moving files around) can be found at 
shivaram/spark-ml@522aec7.

Thanks for testing on the newsgroup20 pipeline!

> The trickiest bit by far was all of the implicit conversions. I ended up 
needing to use several types of implicit conversion imports (case class -> 
schema RDD, spark sql dsl, parameter map, etc.) They also got mysteriously 
deleted by the IDE as I moved files between projects. I ended up having to copy 
and paste these whenever appropriate because I couldn't keep track of them.

I removed implicitMapping because I found that `map(inputCols)` is actually 
shorter than `(inputCols: String)`. For a Scala IDE, I don't think we can trust 
it translating code.

> Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I 
also had to copy and paste code. It's unclear what syntax is valid and what 
isn't. For example, is saying "as outputCol" enough, or is "as 
Symbol(outputCol)" required?

`as(String)` is recently added to Spark master. If you are not on the 
latest version,  you need `as Symbol(outputCol)`.

> There is a lot of boilerplate code. It was easier to write the 
Transformers in the form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so 
I fully agree with Shivaram on that front. Potentially, certain interfaces 
along those lines (iterator to iterator transformers that can be applied to 
RDDs using mappartitions) could make it easier to have transformers not depend 
on local Spark Contexts to execute.

We will keep that option. Note that the strong-type approach will face 
troubles when we need to deal with extra columns, e.g, event ids, or support 
more features later, e.g., support weighted instances.

> I found the parameter mapping in estimators fairly verbose, I like 
Shivaram's idea of having the estimators pass everything to the transformers no 
matter what.

Yes, I like that idea too.

> Estimators requiring the transformers they output to extend Model didn't 
make sense to me. Certain estimators, such as to choose only the most frequent 
tokens in a collection to keep for each document, don't seem like they should 
output models. On that front, should it be required for estimators to specify 
the type of transformer they output? It can be convenient sometimes to just 
inline an anonymous Transformer to output without making it a top-level class.

The set of most frequent tokens could be viewed as a descriptive model in 
your case. I did try estimators without generic model types, but I don't 
remember what went wrong and made me switch to generic.

> There are a lot of parameter traits: HasRegParam, HasMaxIter, 
HasScoreCol, HasFeatureCol Does it make sense to have this many specific 
parameter traits if we still have to maintain boilerplate setters code for Java 
anyway?

I have no preference on this one. I put those before I realize the problem 
with Java. I marked those as `private[ml]` in the current version and I'm okay 
to remove them completely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62196531
  
> @mengxr @jkbradley I tried to port one of our simple image processing 
pipelines to the new interface today and the code for this is at 
https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MnistRandomFFTPipeline.scala
 . 

Thanks for testing with a real pipeline!

> Note that I tried to write this as an application layer on top of Spark 
as I think our primary goal is to enable custom pipelines, not just those using 
our provided components.

Exactly.

 > But as the SQL component itself is pretty new, I think my skill set 
maybe representative of the average Spark application developer.

We are at the stage of learning Spark SQL as well. Note that this component 
will be marked as alpha.

> Because we are using SchemaRDDs directly I found that there is 5-10 lines 
of boiler plate code in every transform or fit function. This is usually 
selecting one or more input columns and getting out an RDD which can be 
processed and then adding an output column at the end. It would be good if we 
can have wrappers which do this automatically (I have one proposal below)

The Scala DSL is new. @liancheng helped add the implicit ".call" method to 
UDF. I think when we fully understand the operations we need, we can work with 
@marmbrus to add those wrappers.

> The UDF interface or Row => Row feels pretty restrictive. For numerical 
computations I often want to do things like mapPartitions or use broadcast 
variables etc. In those cases I am not sure how to directly use UDFs.

We are not restricted by UDFs. As you commented, we can create a new RDD 
and zip back.

> And the main reason I used UDFs was to handle the semantics of appending 
an output column. Is there any other API other than using select Star(None) ... 
? It'd be great if we had something like dataset.zip() which is easier / more 
flexible ?

I think the only boilerplate code here is `Star(None)`. The input columns 
and output column are required. I'm sure it is easy to add operations like

~~~
dataset.appendCol(predict('features) as 'prediction)
~~~

> One proposal I had for simplifying things is to define another 
transformer API which is less general but easier to use (we can keep the 
existing ones as base classes).

I like this idea. We will also keep the `predict` and `predictValues` 
methods that operate on normal RDDs. Note that the main focus of this PR is 
user-facing APIs: how users create pipelines and specify parameters. @jkbradley 
is working on the developer side APIs.

Pipelines API

> Multiple batches of transformers -- One of the main problems I ran into 
was that I couldn't see an easy way to have loops inside a pipeline.

We can create a transformer like the following:

~~~
val featurePipeline = // the pipeline you created
val par = new ParallelPipeline()
  .setPipeline(featurePipeline)
  .setParamMaps(Array(
ParamMap(softMax.outputCol -> "f0")),
ParamMap(softMax.outputCol -> "f1")),
ParamMap(softMax.outputCol -> "f2"
val fvAssembler = new FeatureVectorAssembler()
  .setInputCols(Array("f0", "f1", "f2))
val pipeline = new Pipeline()
  .setStages(Array(featurePipeline, fvAssembler, linearSolver))
~~~

> Passing parameters in Estimators -- I finally understood the problem you 
were describing originally ! However I solved it slightly differently that 
having 2 sets of parameters (This is in 
MultiClassLinearRegressionEstimator.scala). I made parent Estimator class take 
in as params all the parameters required for the child Transformer class as 
well and then passed them along. I thought this was cleaner than having 
modelParams and params as two different members.

Yes, this is cleaner. So we separate Estimator from the output Model but 
both shares the same set of params. I like this idea and then we don't need 
`trainingParams`:)

> Parameters vs. Constructors -- While I agree with your comment that 
constructor arguments make binary compatibility tricky, I ran into a couple of 
cases where creating a transformer without a particular argument doesn't make 
sense. Will we have a guideline that things which won't vary should be 
constructor arguments ? I think it'll be good to come up with some distinction 
that makes it clear for the programmer.

Could you list a few examples? Having many parameters is common for ML 
components. I feel it is hard to decide what parameters won't vary. My 
understanding is that `LogisticRegression` is just a placeholder with 
parameters to be configured. and validation will happen at runtime (not added 
yet).

> Chaining evaluators to a Pipeline -- I wrote a simple evaluator that 
computed test error at the end b

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk

Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62112717
  
At @shivaram's suggestion, I started porting over a simple text classifier 
pipeline that was already using an Estimator/Transformer abstraction of RDD[U] 
to RDD[V] transforms to this interface. The almost-complete port (the imports 
got messed up when moving files around) can be found at 
https://github.com/shivaram/spark-ml/commit/522aec73172b28a4bc1b22df030a459fddbd93dd.
 

Beyond what Shivaram already mentioned, here are my thoughts:

1. The trickiest bit by far was all of the implicit conversions. I ended up 
needing to use several types of implicit conversion imports (case class -> 
schema RDD, spark sql dsl, parameter map, etc.) They also got mysteriously 
deleted by the IDE as I moved files between projects. I ended up having to copy 
and paste these whenever appropriate because I couldn't keep track of them.

2. Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I 
also had to copy and paste code. It's unclear what syntax is valid and what 
isn't. For example, is saying "as outputCol" enough, or is "as 
Symbol(outputCol)" required?

3. There is a lot of boilerplate code. It was easier to write the 
Transformers in the form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so 
I fully agree with Shivaram on that front. Potentially, certain interfaces 
along those lines (iterator to iterator transformers that can be applied to 
RDDs using mappartitions) could make it easier to have transformers not depend 
on local Spark Contexts to execute.

4. I found the parameter mapping in estimators fairly verbose, I like 
Shivaram's idea of having the estimators pass everything to the transformers no 
matter what.

5. Estimators requiring the transformers they output to extend Model didn't 
make sense to me. Certain estimators, such as to choose only the most frequent 
tokens in a collection to keep for each document, don't seem like they should 
output models. On that front, should it be required for estimators to specify 
the type of transformer they output? It can be convenient sometimes to just 
inline an anonymous Transformer to output without making it a top-level class.

6. There are a lot of parameter traits: HasRegParam, HasMaxIter, 
HasScoreCol, HasFeatureCol Does it make sense to have this many specific 
parameter traits if we still have to maintain boilerplate setters code for Java 
anyway?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62105053
  
@mengxr @jkbradley  I tried to port one of our simple image processing 
pipelines to the new interface today and the code for this is at 
https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MnistRandomFFTPipeline.scala
 . Note that I tried to write this as an application layer on top of Spark as I 
think our primary goal is to enable custom pipelines, not just those using our 
provided components. I had some thoughts about the Pipelines and Dataset APIs 
while writing this code and I've detailed them below. It'll be great if you 
could also look over the code to see if I am missing something.

 Dataset API
I should preface this section by saying that my knowledge of SparkSQL and 
SchemaRDD is very limited. I haven't written much code using this and I 
referred to your code + the programming guide for most part. But as the SQL 
component itself is pretty new, I think my skill set maybe representative of 
the average Spark application developer. That said please let me know if I 
missed something obvious.

1. Because we are using SchemaRDDs directly I found that there is 5-10 
lines of boiler plate code in every transform or fit function. This is usually 
selecting one or more input columns and getting out an RDD which can be 
processed and then adding an output column at the end. It would be good if we 
can have wrappers which do this automatically (I have one proposal below)

2. The UDF interface or Row => Row feels pretty restrictive. For numerical 
computations I often want to do things like mapPartitions or use broadcast 
variables etc. In those cases I am not sure how to directly use UDFs.

3. And the main reason I used UDFs was to handle the semantics of appending 
an output column. Is there any other API other than using `select Star(None) 
...` ? It'd be great if we had something like dataset.zip() which is easier / more flexible ?

One proposal I had for simplifying things is to define another transformer 
API which is less general but easier to use (we can keep the existing ones as 
base classes). This API might look something like this 
``` 
  abstract class SimpleTransformer[U, V] extends Transformer with 
HasInputCol with HasOutputCol {
 def transform(RDD[U]): RDD[V] 
  }
```
We will then have wrapper code that will select the input col, cast it to 
`U` and similar take `V` add it as outpuCol to the dataset. There are a couple 
of advantages with this: First is that the RDD API is flexible, well documented 
and well understood by programmers. Secondly this also provides a good boundary 
for unit testing and gives the programmer some type safety within their 
transformer.

 Pipelines API

1. Multiple batches of transformers -- One of the main problems I ran into 
was that I couldn't see an easy way to have loops inside a pipeline. By loops I 
mean repeating some part of the pipeline many times. In this example I wanted 
to run RandomSignNode followed by FFT a bunch of times and then concatenate 
their output as features for Regression. There is a similar pattern in the 
pipeline described in [Evan's blog 
post](https://amplab.cs.berkeley.edu/2014/09/24/4643/)

2. Passing parameters in Estimators -- I finally understood the problem you 
were describing originally ! However I solved it slightly differently that 
having 2 sets of parameters (This is in 
MultiClassLinearRegressionEstimator.scala). I made parent Estimator class take 
in as params all the parameters required for the child Transformer class as 
well and then passed them along. I thought this was cleaner than having 
modelParams and params as two different members.

3. Parameters vs. Constructors -- While I agree with your comment that 
constructor arguments make binary compatibility tricky, I ran into a couple of 
cases where creating a transformer without a particular argument doesn't make 
sense. Will we have a guideline that things which won't vary should be 
constructor arguments ? I think it'll be good to come up with some distinction 
that makes it clear for the programmer.

4. Chaining evaluators to a Pipeline -- I wrote a simple evaluator that 
computed test error at the end but wasn't sure how to chain this to a pipeline. 
Is this something we don't intend to support ?

5. Strings, Vectors etc --  There were a couple of minor things that 
weren't ideal, but I think we will have to live with. With every node having an 
input and output column, there are lots of strings floating around which the 
user needs to get right. Also going from Spark's Vector / Matrix classes to 
Breeze and back to Spark classes gets tedious. We could add commonly used 
operators like add, multiply etc. to our public APIs and that might help a bit.


---
If your project is set up for it,

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62034301
  
  [Test build #23011 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull)
 for   PR 3099 at commit 
[`2d040b3`](https://github.com/apache/spark/commit/2d040b3b99be4192b6301452f04c1a758b0e6a84).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class Estimator[M <: Model] extends PipelineStage with Params 
`
  * `abstract class Evaluator extends Identifiable `
  * `trait Identifiable extends Serializable `
  * `abstract class Model extends Transformer `
  * `class Pipeline extends Estimator[PipelineModel] `
  * `class PipelineModel(val transformers: Array[Transformer]) extends 
Model `
  * `abstract class Transformer extends PipelineStage with Params `
  * `class CrossValidator extends Estimator[CrossValidatorModel] with 
Params `
  * `class CrossValidatorModel(bestModel: Model, metric: Double) extends 
Model `
  * `class LogisticRegressionModel(`
  * `class StandardScaler extends Estimator[StandardScalerModel] with 
HasInputCol `
  * `trait StandardScalerModelParams extends Params with HasInputCol with 
HasOutputCol `
  * `class DoubleParam(parent: Params, name: String, doc: String, default: 
Option[Double] = None)`
  * `class IntParam(parent: Params, name: String, doc: String, default: 
Option[Int] = None)`
  * `case class ParamPair[T](param: Param[T], value: T)`
  * `trait Params extends Identifiable `
  * `class ParamGridBuilder `
  * `trait HasRegParam extends Params `
  * `trait HasMaxIter extends Params `
  * `trait HasFeaturesCol extends Params `
  * `trait HasLabelCol extends Params `
  * `trait HasScoreCol extends Params `
  * `trait HasThreshold extends Params `
  * `trait HasMetricName extends Params `
  * `trait HasInputCol extends Params `
  * `trait HasOutputCol extends Params `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62034255
  
  [Test build #23011 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull)
 for   PR 3099 at commit 
[`2d040b3`](https://github.com/apache/spark/commit/2d040b3b99be4192b6301452f04c1a758b0e6a84).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62034309
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23011/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread etrain

Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61978539
  
Oh, and I'm not saying "let's not support PMML" - I'm saying let's have a 
sensible way of handing off models to other JVM programs that doesn't involve 
writing to XML in the middle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread etrain

Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61977531
  
I've got several other comments on this PR - mostly good, and will leave 
some more detailed comments in a bit. TL;DR - What's wrong with Java 
Serialization?

But - PMML support seems a little like overkill here. I admit I'm fairly 
ignorant to the details of PMML, but my understanding is that it is designed to 
facilitate transfer of models between languages - e.g. SAS to Java to R, etc. 
While I'm sure it's general purpose enough to capture complex pipelines, I'd be 
surprised if it can do so efficiently.

@jegonzal and @dcrankshaw are talking about a java-based runtime 
environment for serving complex pipelines trained via spark and MLlib. Spark is 
already pretty good about shipping JVM code around to remote sites and 
executing it - it does so by serializing and shipping standard Java objects, 
and I don't see why we should follow a very different pattern here. This design 
has been part of MLlib since day 1.

I don't want to speak for these two, but I don't think they have a problem 
having a runtime dependency on MLlib or some other JVM-based machine learning 
library, their main issue is that they don't want to have to call into a batch 
system (aka spark) to execute their models.

@jkbradley - I agree that some models or transformers are going to require 
a lot of state, and potentially distributed computation, but this should be the 
exception, not the rule. In general, an Estimator should compute a fairly small 
object (Transformer) which is small enough to be passed around and doesn't need 
cluster resources to run. In the case of outlier removal, for example, I'd 
imagine that the Estimator would take several passes over the data and compute 
some sufficient statistics to remove new points. 

For cases like ALS, where the model is really big, this is exactly where 
@jegonzal and @dcrankshaw's research comes in. To make an analogy - just as I'd 
happily use spark to compute a distributed inverted index, I certainly wouldn't 
use it to do point-queries on that index. So some interface for transmitting 
this large distributed state to a system more prepared to answer point queries 
based on that state is required.

At any rate - this is great stuff, I just want to make sure we don't get 
lost in the weeds of supporting the general case at the expense of streamlined 
support for the obvious case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread vruusmann

Github user vruusmann commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61949888
  
@jegonzal PMML is essentially a domain-specific language (DSL) for the 
domain of predictive analytic applications. It is commonly used only for the 
representation of statistical and data mining models (ie. simple "model export" 
functionality). But it can do much more, including chaining multiple models 
together into a so-called model ensemble, implementing complex data pre- and 
post-processing logic, adding annotations and machine-readable documentation to 
the solution and so on.
I have a separate project in the works that deals specifically with 
translating memory- and CPU-intensive parts of PMML documents into Java source 
code for more efficient execution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jegonzal

Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61935382
  
Both of those solutions seem reasonable.  However I wonder if PMML is 
sufficient to express complex model pipelines?  The later solution seems much 
more interesting and would be perhaps something  @dcrankshaw might want 
investigate. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61934791
  
@jegonzal @jkbradley (This may be slightly off-topic.) To serve models, 
which are usually long pipelines in practice, I'm thinking of the following:

1. serialize to portable format, like PMML. For example, Google Prediction 
API accepts PMML with transformations. @selvinsource is working on #3062  and 
@srowen is helping.
2. code generation, this is less compatible than PMML, but faster and maybe 
easier to implement. For example, we can do the following:

~~~
val model = pipeline.fit(dataset)
val code = model.generateCode(schema)
~~~

Then the compiled code can make predictions on normal Java input, without 
dependencies on Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jegonzal

Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61934417
  
@jkbradley Right now we are planning to serve linear combinations of models 
derived from MLlib (currently latent factor models, naive bayes, and decision 
trees).  Though I agree that in some cases (e.g., latent factor models) serving 
is less trivial (thats the research).  Still, it would be good to think of 
models as output to be consumed by systems beyond Spark should people want to 
do things beyond computing test error, admittedly thats all I ever do :-). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61932628
  
@jegonzal  Do you feel like most model classes should be serve-able by 
design?  Or would you recommend having lighter model classes which can be built 
from regular model classes, serialized, and shipped for serving?  When you have 
examples of serve-able models available, it would be great to see them to plan 
ahead for Spark.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61932391
  
@mengxr @shivaram  Adding to the conversation...
> In addition to the SchemaRDD based transformer API, it would be good to 
have a single Row transformation API too.
I agree, but per-Row transformations may need to be kept to subclasses of 
Transformer.  E.g., a Transformer which filters out outlier instances might 
only be able to transform a full RDD, whereas most Classifier/Regressor 
concepts can operate per-Row.  I plan to do a PR for some of these abstractions 
after Pipelines are merged in.

I think @mengxr covered the other answers so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61931724
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22991/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61931722
  
  [Test build #22991 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull)
 for   PR 3099 at commit 
[`fd751fc`](https://github.com/apache/spark/commit/fd751fc038bdab9367d9fd07635e9ec30526f7a2).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class Estimator[M <: Model] extends PipelineStage with Params 
`
  * `abstract class Evaluator extends Identifiable `
  * `trait Identifiable extends Serializable `
  * `abstract class Model extends Transformer `
  * `class Pipeline extends Estimator[PipelineModel] `
  * `class PipelineModel(val transformers: Array[Transformer]) extends 
Model `
  * `abstract class Transformer extends PipelineStage with Params `
  * `trait HasEstimator extends Params `
  * `trait HasEvaluator extends Params `
  * `trait HasEstimatorParamMaps extends Params `
  * `class CrossValidatorModel(bestModel: Model, metric: Double) extends 
Model `
  * `class LogisticRegressionModel(`
  * `class StandardScaler extends Transformer with Params with HasInputCol 
with HasOutputCol `
  * `class DoubleParam(parent: Params, name: String, doc: String, default: 
Option[Double] = None)`
  * `class IntParam(parent: Params, name: String, doc: String, default: 
Option[Int] = None)`
  * `case class ParamPair[T](param: Param[T], value: T)`
  * `trait Params extends Identifiable `
  * `class ParamGridBuilder `
  * `trait HasRegParam extends Params `
  * `trait HasMaxIter extends Params `
  * `trait HasFeaturesCol extends Params `
  * `trait HasLabelCol extends Params `
  * `trait HasScoreCol extends Params `
  * `trait HasThreshold extends Params `
  * `trait HasMetricName extends Params `
  * `trait HasInputCol extends Params `
  * `trait HasOutputCol extends Params `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61931707
  
  [Test build #22991 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull)
 for   PR 3099 at commit 
[`fd751fc`](https://github.com/apache/spark/commit/fd751fc038bdab9367d9fd07635e9ec30526f7a2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3099#discussion_r19927481
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.annotation.varargs
+
+import org.apache.spark.ml.param.{ParamMap, ParamPair, Params}
+import org.apache.spark.sql.SchemaRDD
+
+/**
+ * Abstract class for estimators that fits models to data.
+ */
+abstract class Estimator[M <: Model] extends PipelineStage with Params {
+
+  /**
+   * Fits a single model to the input data with optional parameters.
+   *
+   * @param dataset input dataset
+   * @param paramPairs optional list of param pairs, overwrite embedded 
params
+   * @return fitted model
+   */
+  @varargs
+  def fit(dataset: SchemaRDD, paramPairs: ParamPair[_]*): M = {
+val map = new ParamMap()
+paramPairs.foreach(map.put(_))
+fit(dataset, map)
+  }
+
+  /**
+   * Fits a single model to the input data with provided parameter map.
+   *
+   * @param dataset input dataset
+   * @param paramMap parameters
+   * @return fitted model
+   */
+  def fit(dataset: SchemaRDD, paramMap: ParamMap): M
+
+  /**
+   * Fits multiple models to the input data with multiple sets of 
parameters.
+   * The default implementation uses a for loop on each parameter map.
+   * Subclasses could overwrite this to optimize multi-model training.
+   *
+   * @param dataset input dataset
+   * @param paramMaps an array of parameter maps
+   * @return fitted models, matching the input parameter maps
+   */
+  def fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Seq[M] = { // 
how to return an array?
+paramMaps.map(fit(dataset, _))
--- End diff --

.toArray ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61930130
  
  [Test build #22987 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull)
 for   PR 3099 at commit 
[`3f810cd`](https://github.com/apache/spark/commit/3f810cd31f451a9b0019cd8f69e9fbe400dd4269).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class Estimator[M <: Model] extends PipelineStage with Params 
`
  * `abstract class Evaluator extends Identifiable `
  * `trait Identifiable extends Serializable `
  * `abstract class Model extends Transformer `
  * `class Pipeline extends Estimator[PipelineModel] `
  * `class PipelineModel(val transformers: Array[Transformer]) extends 
Model `
  * `abstract class Transformer extends PipelineStage with Params `
  * `trait HasEstimator extends Params `
  * `trait HasEvaluator extends Params `
  * `trait HasEstimatorParamMaps extends Params `
  * `class CrossValidatorModel(bestModel: Model, metric: Double) extends 
Model `
  * `class LogisticRegressionModel(`
  * `class StandardScaler extends Transformer with Params with HasInputCol 
with HasOutputCol `
  * `class DoubleParam(parent: Params, name: String, doc: String, default: 
Option[Double] = None)`
  * `class IntParam(parent: Params, name: String, doc: String, default: 
Option[Int] = None)`
  * `case class ParamPair[T](param: Param[T], value: T)`
  * `trait Params extends Identifiable `
  * `class ParamGridBuilder `
  * `trait HasRegParam extends Params `
  * `trait HasMaxIter extends Params `
  * `trait HasFeaturesCol extends Params `
  * `trait HasLabelCol extends Params `
  * `trait HasScoreCol extends Params `
  * `trait HasThreshold extends Params `
  * `trait HasMetricName extends Params `
  * `trait HasInputCol extends Params `
  * `trait HasOutputCol extends Params `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61930125
  
  [Test build #22987 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull)
 for   PR 3099 at commit 
[`3f810cd`](https://github.com/apache/spark/commit/3f810cd31f451a9b0019cd8f69e9fbe400dd4269).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61930133
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22987/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61929839
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22986/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61929837
  
  [Test build #22986 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull)
 for   PR 3099 at commit 
[`9f408ed`](https://github.com/apache/spark/commit/9f408edef1c7f23323bf8a9d86ae69cbbda7002e).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class Estimator[M <: Model] extends PipelineStage with Params 
`
  * `abstract class Evaluator extends Identifiable `
  * `trait Identifiable extends Serializable `
  * `abstract class Model extends Transformer `
  * `class Pipeline extends Estimator[PipelineModel] `
  * `class PipelineModel(val transformers: Array[Transformer]) extends 
Model `
  * `abstract class Transformer extends PipelineStage with Params `
  * `trait HasEstimator extends Params `
  * `trait HasEvaluator extends Params `
  * `trait HasEstimatorParamMaps extends Params `
  * `class CrossValidatorModel(bestModel: Model, metric: Double) extends 
Model `
  * `class LogisticRegressionModel(`
  * `class StandardScaler extends Transformer with Params with HasInputCol 
with HasOutputCol `
  * `class DoubleParam(parent: Params, name: String, doc: String, default: 
Option[Double] = None)`
  * `class IntParam(parent: Params, name: String, doc: String, default: 
Option[Int] = None)`
  * `case class ParamPair[T](param: Param[T], value: T)`
  * `trait Params extends Identifiable `
  * `class ParamGridBuilder `
  * `trait HasRegParam extends Params `
  * `trait HasMaxIter extends Params `
  * `trait HasFeaturesCol extends Params `
  * `trait HasLabelCol extends Params `
  * `trait HasScoreCol extends Params `
  * `trait HasThreshold extends Params `
  * `trait HasMetricName extends Params `
  * `trait HasInputCol extends Params `
  * `trait HasOutputCol extends Params `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61929832
  
  [Test build #22986 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull)
 for   PR 3099 at commit 
[`9f408ed`](https://github.com/apache/spark/commit/9f408edef1c7f23323bf8a9d86ae69cbbda7002e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61923441
  
@manishamde Thanks for your feedback!

> Like @jkbradley, I prefer lr.setMaxIter(50) over lr.set(lr.maxIter, 50).

Me too and this is implemented in the current version.

> Also, prefer to avoid passing parameters to fit like lr.fit(dataset, 
lr.maxIter -> 50).

1. It is useful when you want to try some parameters interactively. For 
example, I'm using the default setting and I want to see how each parameter 
affects the result. I can try:

~~~
val lr = new LogisticRegression()
  .setMaxIter(20)
  .setRegParam(0.1)
lr.fit(dataset, lr.maxIter -> 50)
lr.fit(dataset, lr.regParam -> 1.0) // note that maxIter goes back to the 
default setting
~~~

If we only allow setters, the code will be

~~~
val previousMaxIter = lr.getMaxIter
lr.setMaxIter(50)
lr.fit(dataset)
lr.setMaxIter(previousMaxIter)
val previousRegParam = lr.getRegParam
lr.fit(dataset)
lr.setRegParam(previousRegParam)
~~~

Another reason I want to have parameters specified in `fit` is for 
multi-model training, as described in the design doc.

> Constructors with getters and setters as @shivaram pointed will be great. 
The LOC reduction is important and should not be discounted.

Besides the binary compatibility issue and Java issue I mentioned, it 
doesn't save you many characters:

~~~
val lr = new LogisticRegression(maxIter = 50, regParam = 0.1)

val lr = new LogisticRegression()
  .setMaxIter(50)
  .setRegParam(0.1)
~~~

> Do we plan to provided syntactic sugar such as a predict method when we 
use model to transform a dataset? For me transform fits well with the feature 
engineering stage and predict after the model training has been performed.

I think we should keep methods that operate on normal RDDs and individual 
instances.

> It will be great to see the corresponding examples in Python.The 
getter/setters would map well to Python properties. Also, it will be nice to do 
an apples-to-apples comparison with the scikit-learn pipeline.

We need to deal with the serialization of objects and parameters. @davies 
is the expert. I expect the Python API be very similar to Scala/Java API.

> Finally, how do we plan to programatically answer (developer/user) 
queries about algorithm properties such as multiclass classification support, 
using internal storage format, etc.

This is beyond this PR. SPARK-3702 is relevant to your question, which 
@jkbradley is working on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jegonzal

Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61922586
  
The model serving work would really benefit from being able to evaluate 
models without requiring a Spark context especially since we are shooting for 
10s millisecond latencies.  Though more generally, we should think about how 
one might want to use the artifact of the pipeline.  I suspect their are uses 
that may exist outside of Spark and the extent to which the models themselves 
are "portable functions" could enable greater adoption. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61922261
  
@shivaram Thanks for your feedback!

> Could we have constructors also with getter, setters?

It would be hard to maintain binary compatibility in the long run. After we 
remove the @Experimental tag, each time we add a new parameter, we need to 
create a new auxiliary constructor. It is also easier to deprecate parameters.

It will also make Scala code different from Java's, which we try to avoid. 
Using getters and setters, the Java code `JavaLogisticRegressionSuite` looks 
very similar to the Scala code `LogisticRegressionSuite`.

> Also, FWIW I am not very sure about having setter, getters using traits 
like HasRegParam etc. This creates a lot of classes as we add more classes each 
having 2-3 params ?

I'm okay with either approach. I felt this is a nice feature of Scala. Some 
parameters are very common, e.g., `regParam`, `maxIter`, and `featuresCol`. It 
is tedious to copy the same code around. And we can group parameter traits into 
something like `IterativeParams` or `RegressionModelParams` later. However, 
when  I coded this up, I found Java doesn't interpret the return type 
correctly. So I have to override the setters in each class, which is bad.

> Passing params across modules.

The model parameter like `featuresCol` is set before training. So we still 
need to deal with the logic after training:

1) If `lr.model.featuresCol` is set, use it.
2) If `lr.model.featuresCol` is not set, use `lr.featuresCol` or keep the 
default if `lr.featuresCol` is not set either.

> In addition to the SchemaRDD based transformer API, it would be good to 
have a single Row transformation API too.

I agree. We need to know the schema in order to transform individual 
instances. The row object doesn't have reference to its schema. We can add 
`schema` as a parameter, which is required to transform a row. Btw, we will 
keep the `predict` method that works on a single Vector.

> Other than the API, the biggest worry I have is in terms of memory 
management.

We can call `unpersist` inside `StandardScaler` after we computed the mean 
and the standard deviation. It is not an issue here but I got your point. We 
can add pipeline components that persist/checkpoint input datasets. When to 
`unpersist` is always the problem. I'll think about it.

> Also if you don't mind I'll try to port some of the image processing 
pipelines we have to this class structure by forking your branch. I feel 
that'll help me figure out if what features are easy to use etc.

That's great! I felt that making the code compile is really helpful to see 
the trade-offs. But please understand that this is a WIP and I may update the 
code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3099#discussion_r19923238
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala ---
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import java.lang.reflect.Modifier
+
+import scala.language.implicitConversions
+import scala.collection.mutable
+
+/**
+ * A param with self-contained documentation and optionally default value.
+ *
+ * @param parent parent object
+ * @param name param name
+ * @param doc documentation
+ * @param default optional default value
+ * @tparam T param value type
+ */
+class Param[T] private (
+val parent: Identifiable,
+val name: String,
+val doc: String,
+val default: Option[T]) extends Serializable {
+
+  /**
+   * Creates a param without a default value.
+   *
+   * @param parent parent object
+   * @param name param name
+   * @param doc documentation
+   */
+  def this(parent: Identifiable, name: String, doc: String) = this(parent, 
name, doc, None)
+
+  /**
+   * Creates a param with a default value.
+   *
+   * @param parent parent object
+   * @param name param name
+   * @param doc documentation
+   * @param default default value
+   */
+  def this(parent: Identifiable, name: String, doc: String, default: T) =
+this(parent, name, doc, Some(default))
+
+  /**
+   * Creates a param pair with the given value (for Java).
+   */
+  def w(value: T): ParamPair[T] = ParamPair(this, value)
+
+  /**
+   * Creates a param pair with the given value (for Scala).
+   */
+  def ->(value: T): ParamPair[T] = ParamPair(this, value)
+
+  override def toString: String = {
+if (default.isDefined) {
+  s"$name: $doc (default: ${default.get})"
+} else {
+  s"$name: $doc"
+}
+  }
+}
+
+/**
+ * A param amd its value.
+ */
+case class ParamPair[T](param: Param[T], value: T)
+
+/**
+ * Trait for components that take parameters.
+ */
+trait Params {
+
+  /** Returns all params. */
+  def params: Array[Param[_]] = {
+val methods = this.getClass.getMethods
+methods.filter { m =>
+  Modifier.isPublic(m.getModifiers) &&
+classOf[Param[_]].isAssignableFrom(m.getReturnType) &&
+m.getParameterTypes.isEmpty
+}.map(m => m.invoke(this).asInstanceOf[Param[_]])
+  }
+
+  /** Gets a param by its name. */
+  def getParam(paramName: String): Param[Any] = {
+val m = this.getClass.getMethod(paramName)
+assert(Modifier.isPublic(m.getModifiers) &&
+  classOf[Param[_]].isAssignableFrom(m.getReturnType))
+m.invoke(this).asInstanceOf[Param[Any]]
+  }
+
+  /**
+   * Validates parameters specified by the input parameter map.
+   * Raises an exception if any parameter belongs to this object is 
invalid.
+   */
+  def validateParams(paramMap: ParamMap): Unit = {}
+
+  /**
+   * Returns the documentation of all params.
+   */
+  def explainParams(): String = params.mkString("\n")
+}
+
+/**
+ * Trait for instances that hold their own param maps.
+ */
+trait OwnParamMap {
+
+  /**
+   * Internal param map.
+   */
+  val paramMap: ParamMap = ParamMap.empty
+
+  /**
+   * Sets a parameter in the own parameter map.
+   */
+  def set[T](param: Param[T], value: T): this.type = {
+paramMap.put(param.asInstanceOf[Param[Any]], value)
+this
+  }
+}
+
+private[ml] object Params {
+
+  /**
+   * Returns a Params implementation without any
+   */
+  val empty: Params = new Params {
+override def params: Array[Param[_]] = Array.empty
+  }
+}
+
+/**
+ * A param to value map.
+ */
+class ParamMap private[ml] (
+priv

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61916011
  
I have a few comments based upon the API:

1. Like @jkbradley,  I prefer ```lr.setMaxIter(50)``` over 
```lr.set(lr.maxIter, 50)```. Also, prefer to avoid passing parameters to fit 
like ```lr.fit(dataset, lr.maxIter -> 50)```.

2. Constructors with getters and setters as @shivaram pointed will be 
great. The LOC reduction is important and should not be discounted.

3. Do we plan to provided syntactic sugar such as a ```predict``` method 
when we use ```model``` to transform a dataset? For me ```transform``` fits 
well with the feature engineering stage and ```predict``` after the model 
training has been performed.

4. It will be great to see the corresponding examples in Python.The 
getter/setters would map well to  Python properties. Also, it will be nice to 
do an apples-to-apples comparison with the scikit-learn pipeline.

5. Finally, how do we plan to programatically answer (developer/user) 
queries about algorithm properties such as multiclass classification support, 
using internal storage format, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61913282
  
@mengxr Thanks for putting this together ! I had some high level comments 
by looking at the code

1. Could we have constructors also with getter, setters ? This is the same 
style we use in MLLib and it will make user's code much cleaner. 
```
val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
val lr = new LogisticRegression()
  .setFeaturesCol("scaledFeatures")


will become one liners like

```
val scaler = new StandardScaler(inputCol="features", 
outputCol="scaledFeatures")
val lr = new LogisticRegression(featuresCol="scaledFeatures")
```

Also we will still have an empty constructor that will just set the default 
values that exist for Params right now. 

Also, FWIW I am not very sure about having setter, getters using traits 
like HasRegParam etc. This creates a lot of classes as we add more classes each 
having 2-3 params ? 

2. Passing params across modules: The code to pass params from Estimator to 
Transformer is a little clunky right now. I think this can also be solved with 
the constructor interface above.

For example in LogisticRegression.scala, we can have

```
val lrm = new LogisticRegressionModel(lr.run(instances).weights)
if (!lrm.paramMap.contains(lrm.featuresCol) && 
map.contains(lrm.featuresCol)) {
   lrm.setFeaturesCol(featuresCol)
}
```
can become
```
val lrm = new LogisticRegressionModel(lr.run(instances).weights, 
featureCol=featuresCol)
```

3. In addition to the SchemaRDD based transformer API, it would be good to 
have a single Row transformation API too. If you want to reuse the same 
transformers for training and for online prediction you would want to use the 
same Pipeline, but with 1 element at a time or something like that ? There are 
some folks in the AMPLab (cc @jegonzal @dcrankshaw) who are working on model 
serving and this would be useful for integration with such systems.
Also I can see the single element API being very useful for unit testing.

4. Other than the API, the biggest worry I have is in terms of memory 
management. Because most of the transformations can be lazy operations, it is 
very tricky to figure out when the programmer should call persist / unpersist. 
For example in StandardScaler.scala, we now have `input.cache()`. But AFAIK 
there is no simple way to do unpersist this RDD if the Scaler is being run 
lazily. 

Also if you don't mind I'll try to port some of the image processing 
pipelines we have to this class structure by forking your branch. I feel 
that'll help me figure out if what features are easy to use etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61905231
  
  [Test build #22962 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull)
 for   PR 3099 at commit 
[`1ef26e0`](https://github.com/apache/spark/commit/1ef26e07af574dee8432a1509d0efb46c6074b1d).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61905232
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22962/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61905223
  
  [Test build #22962 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull)
 for   PR 3099 at commit 
[`1ef26e0`](https://github.com/apache/spark/commit/1ef26e07af574dee8432a1509d0efb46c6074b1d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61864795
  
I vote for the lr.setRegParam(0.1) setup.  I also vote for setting 
parameters beforehand, and not allowing parameters in fit().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3099#discussion_r19897343
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala ---
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import java.lang.reflect.Modifier
+
+import scala.language.implicitConversions
+import scala.collection.mutable
+
+/**
+ * A param with self-contained documentation and optionally default value.
+ *
+ * @param parent parent object
+ * @param name param name
+ * @param doc documentation
+ * @param default optional default value
+ * @tparam T param value type
+ */
+class Param[T] private (
+val parent: Identifiable,
+val name: String,
+val doc: String,
+val default: Option[T]) extends Serializable {
+
+  /**
+   * Creates a param without a default value.
+   *
+   * @param parent parent object
+   * @param name param name
+   * @param doc documentation
+   */
+  def this(parent: Identifiable, name: String, doc: String) = this(parent, 
name, doc, None)
+
+  /**
+   * Creates a param with a default value.
+   *
+   * @param parent parent object
+   * @param name param name
+   * @param doc documentation
+   * @param default default value
+   */
+  def this(parent: Identifiable, name: String, doc: String, default: T) =
+this(parent, name, doc, Some(default))
+
+  /**
+   * Creates a param pair with the given value (for Java).
+   */
+  def w(value: T): ParamPair[T] = ParamPair(this, value)
+
+  /**
+   * Creates a param pair with the given value (for Scala).
+   */
+  def ->(value: T): ParamPair[T] = ParamPair(this, value)
+
+  override def toString: String = {
+if (default.isDefined) {
+  s"$name: $doc (default: ${default.get})"
+} else {
+  s"$name: $doc"
+}
+  }
+}
+
+/**
+ * A param amd its value.
+ */
+case class ParamPair[T](param: Param[T], value: T)
+
+/**
+ * Trait for components that take parameters.
+ */
+trait Params {
+
+  /** Returns all params. */
+  def params: Array[Param[_]] = {
+val methods = this.getClass.getMethods
+methods.filter { m =>
+  Modifier.isPublic(m.getModifiers) &&
+classOf[Param[_]].isAssignableFrom(m.getReturnType) &&
+m.getParameterTypes.isEmpty
+}.map(m => m.invoke(this).asInstanceOf[Param[_]])
+  }
+
+  /** Gets a param by its name. */
+  def getParam(paramName: String): Param[Any] = {
+val m = this.getClass.getMethod(paramName)
+assert(Modifier.isPublic(m.getModifiers) &&
+  classOf[Param[_]].isAssignableFrom(m.getReturnType))
+m.invoke(this).asInstanceOf[Param[Any]]
+  }
+
+  /**
+   * Validates parameters specified by the input parameter map.
+   * Raises an exception if any parameter belongs to this object is 
invalid.
+   */
+  def validateParams(paramMap: ParamMap): Unit = {}
+
+  /**
+   * Returns the documentation of all params.
+   */
+  def explainParams(): String = params.mkString("\n")
+}
+
+/**
+ * Trait for instances that hold their own param maps.
+ */
+trait OwnParamMap {
+
+  /**
+   * Internal param map.
+   */
+  val paramMap: ParamMap = ParamMap.empty
+
+  /**
+   * Sets a parameter in the own parameter map.
+   */
+  def set[T](param: Param[T], value: T): this.type = {
+paramMap.put(param.asInstanceOf[Param[Any]], value)
+this
+  }
+}
+
+private[ml] object Params {
+
+  /**
+   * Returns a Params implementation without any
+   */
+  val empty: Params = new Params {
+override def params: Array[Param[_]] = Array.empty
+  }
+}
+
+/**
+ * A param to value map.
+ */
+class ParamMap private[ml] (
+priva

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61858097
  
Both the reference and the class internals are immutable, no?  Typical Java 
conventions would put such a variable in all caps, though maybe in Scala it's 
different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61854896
  
@sryza `maxIter` is not a constant. It is a parameter key in the current 
design.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61853890
  
If maxIter is a constant, would it be clearer to use MAX_ITER?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61785223
  
@mengxr Yes I agree with your conclusion. With a generic get/set by key 
method, I suppose you don't strictly need all the particular getters and 
setters. I suppose that's simpler, and if there's just one way to do it, 
there's no confusion about key vs getter/setter. I don't know if you want to 
remove the getters/setters entirely, but it wouldn't be crazy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61775481
  
@srowen Glad to hear that you like it :) Your feedback will be greatly 
appreciated, but I don't want to waste your time on minor details. Let's run 
the discussion in the main thread, so we can still see them when I update the 
code. One thing I have been struggling with is the way we handle parameters 
here. Please check the last section of the PR description. Basically, I think

~~~
val lr = new LogisticRegression
  .setMaxIter(50)
  .setRegParam(0.1)
~~~

looks better than

~~~ 
val lr = new LogisticRegression
lr.set(lr.maxIter, 50)
  .set(lr.regParam, 0.1)
~~~

But if we use setters/getters, it is a little weird that `lr.maxIter` is a 
parameter key instead of its value. Or we can let `lr.maxIter` store the value 
(so the setters/getters can be generated using @BeanProperty), and then use 
something like `lr.maxIter_` (underscore) as the parameter key. The code would 
become

~~~
val lr = new LogisticRegression
  .setRegParam(0.1) // attach regParam = 0.1 with the object "lr"
val lrParamMaps = new ParamMapBuilder()
  .addMulti(lr.maxIter_, Array(10, 20)) // specify parameters (with 
underscore) outside the object "lr"
  .build()
// multi-model training
val models = lr.fit(dataset, lrParamMaps)
~~~

Does it look better than the current version?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3099#discussion_r19859297
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala ---
@@ -0,0 +1,6 @@
+package org.apache.spark.ml
--- End diff --

Oh Ok, is it worth me reviewing this yet? I'm liking it already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...