[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread tomerk
Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r23891560
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/Regressor.scala ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, 
PredictorParams}
+
+/**
+ * :: DeveloperApi ::
+ * Params for regression.
+ * Currently empty, but may add functionality later.
+ */
+@DeveloperApi
+trait RegressorParams extends PredictorParams
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Single-label regression
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., 
[[org.apache.spark.mllib.linalg.Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class Regressor[
+FeaturesType,
+Learner <: Regressor[FeaturesType, Learner, M],
--- End diff --

Maybe it would also make sense to have FeaturesType be a single letter for 
consistency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-26 Thread tomerk
Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-71546339
  
Well, from my perspective an ideal interface for scala-only support for the 
developer API example would look something like as follows:

```scala
/**
 * Example of defining a type of [[Classifier]].
 *
 * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
 */
private class MyLogisticRegression
  extends Classifier[Vector]
  with MaxIterParam(100) { // Initialize default value of MaxIter

  // This method is used by fit()
  override protected def train(
  dataset: SchemaRDD,
  params: ParamMap): MyLogisticRegressionModel = {
// Extract columns from data using helper method.
val oldDataset = extractLabeledPoints(dataset, params)

// Do learning to estimate the weight vector.
val numFeatures = oldDataset.take(1)(0).features.size
val weights = Vectors.zeros(numFeatures) // Learning would happen here.

// Create a model, and return it.
new MyLogisticRegressionModel(weights)
  }
}

/**
 * Example of defining a type of [[ClassificationModel]].
 *
 * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
 */
private class MyLogisticRegressionModel(val weights: Vector)
  extends ClassificationModel[Vector]  {

  // This uses the default implementation of transform(), which reads 
column "features" and outputs
  // columns "prediction" and "rawPrediction."

  // This uses the default implementation of predict(), which chooses the 
label corresponding to
  // the maximum value returned by [[predictRaw()]].

  /**
   * Raw prediction for each possible label.
   * The meaning of a "raw" prediction may vary between algorithms, but it 
intuitively gives
   * a measure of confidence in each possible label (where larger = more 
confident).
   * This internal method is used to implement [[transform()]] and output 
[[rawPredictionCol]].
   *
   * @return  vector where element i is the raw prediction for label i.
   *  This raw prediction may be any real number, where a larger 
value indicates greater
   *  confidence for that label.
   */
  override protected def predictRaw(features: Vector): Vector = {
val margin = BLAS.dot(features, weights)
// There are 2 classes (binary classification), so we return a length-2 
vector,
// where index i corresponds to class i (i = 0, 1).
Vectors.dense(-margin, margin)
  }

  /** Number of classes the label can take.  2 indicates binary 
classification. */
  override val numClasses: Int = 2
}
```

I guess some things of note here are:
- Less parameter trickiness (Already discussed)
- Less generics needed everywhere (Thanks to scala type inference & 
this.type)
- No need for developers to specify their own copy method (which would 
require developers to remember that the parameter map requires a deep copy), it 
would just happen in the background somehow
- No need to specify the "fittingParamMap" in the transformer's definition, 
the background stuff should automatically pass everything along to the 
transformer

Like I said, I have doubts about how much of this can be done because of 
the need to support Java.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-26 Thread tomerk
Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-71510427
  
While this does look better, it still seems like a much trickier interface 
than it theoretically should be (at least if it only needed to support scala), 
but given the need to support java interop I'm not sure what could be done to 
improve it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-09 Thread tomerk
Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22752722
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 ---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import org.apache.spark.annotation.{AlphaComponent, DeveloperApi}
+import org.apache.spark.ml.param.{HasProbabilityCol, ParamMap, Params}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.analysis.Star
+
+/**
+ * Params for probabilistic classification.
+ */
+private[classification] trait ProbabilisticClassifierParams
+  extends ClassifierParams with HasProbabilityCol {
+
+  override protected def validateAndTransformSchema(
+  schema: StructType,
+  paramMap: ParamMap,
+  fitting: Boolean,
+  featuresDataType: DataType): StructType = {
+val parentSchema = super.validateAndTransformSchema(schema, paramMap, 
fitting, featuresDataType)
+val map = this.paramMap ++ paramMap
+addOutputColumn(parentSchema, map(probabilityCol), new VectorUDT)
+  }
+}
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Single-label binary or multiclass classifier which can output class 
conditional probabilities.
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., [[Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class ProbabilisticClassifier[
+FeaturesType,
+Learner <: ProbabilisticClassifier[FeaturesType, Learner, M],
+M <: ProbabilisticClassificationModel[FeaturesType, M]]
+  extends Classifier[FeaturesType, Learner, M] with 
ProbabilisticClassifierParams {
+
+  def setProbabilityCol(value: String): Learner = set(probabilityCol, 
value).asInstanceOf[Learner]
+}
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Model produced by a [[ProbabilisticClassifier]].
+ * Classes are indexed {0, 1, ..., numClasses - 1}.
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., [[Vector]]
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class ProbabilisticClassificationModel[
+FeaturesType,
+M <: ProbabilisticClassificationModel[FeaturesType, M]]
+  extends ClassificationModel[FeaturesType, M] with 
ProbabilisticClassifierParams {
+
+  def setProbabilityCol(value: String): M = set(probabilityCol, 
value).asInstanceOf[M]
+
+  /**
+   * Transforms dataset by reading from [[featuresCol]], and appending new 
columns as specified by
+   * parameters:
+   *  - predicted labels as [[predictionCol]] of type [[Double]]
+   *  - raw predictions (confidences) as [[rawPredictionCol]] of type 
[[Vector]]
+   *  - probability of each class as [[probabilityCol]] of type [[Vector]].
+   *
+   * @param dataset input dataset
+   * @param paramMap additional parameters, overwrite embedded params
+   * @return transformed dataset
+   */
+  override def transform(dataset: SchemaRDD, paramMap: ParamMap): 
SchemaRDD = {
+// This default implementation should be overridden as needed.
+import dataset.sqlContext._
+import org.apache.spark.sql.catalyst.dsl._
+
+// Check schema
+transformSchema(dataset.schema, paramMap, logging = true)
+val map = this.paramMap ++ paramMap
+
+// Prepare model
+val tmpModel = if (paramMap.size != 0) {
+  val tmpModel = this.copy()
+  Params.inheritValues(paramMap, parent, tmpModel)
+  tmpModel
+} else {
+  this
+}
+
+val (numColsOutput, outputData) =
+  ClassificationModel.transformColumnsImpl[FeaturesType](dataset, 
tmpModel, map)
+
+// Output selected columns only.
  

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-09 Thread tomerk
Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22752339
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -80,69 +50,157 @@ class LogisticRegression extends 
Estimator[LogisticRegressionModel] with Logisti
 
   def setRegParam(value: Double): this.type = set(regParam, value)
   def setMaxIter(value: Int): this.type = set(maxIter, value)
-  def setLabelCol(value: String): this.type = set(labelCol, value)
   def setThreshold(value: Double): this.type = set(threshold, value)
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
-  def setScoreCol(value: String): this.type = set(scoreCol, value)
-  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
   override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
LogisticRegressionModel = {
+// Check schema
 transformSchema(dataset.schema, paramMap, logging = true)
-import dataset.sqlContext._
+
+// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+val oldDataset = extractLabeledPoints(dataset, paramMap)
 val map = this.paramMap ++ paramMap
-val instances = dataset.select(map(labelCol).attr, 
map(featuresCol).attr)
-  .map { case Row(label: Double, features: Vector) =>
-LabeledPoint(label, features)
-  }.persist(StorageLevel.MEMORY_AND_DISK)
+val handlePersistence = dataset.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  oldDataset.persist(StorageLevel.MEMORY_AND_DISK)
+}
+
+// Train model
 val lr = new LogisticRegressionWithLBFGS
 lr.optimizer
   .setRegParam(map(regParam))
   .setNumIterations(map(maxIter))
-val lrm = new LogisticRegressionModel(this, map, 
lr.run(instances).weights)
-instances.unpersist()
+val oldModel = lr.run(oldDataset)
+val lrm = new LogisticRegressionModel(this, map, oldModel.weights, 
oldModel.intercept)
+
+if (handlePersistence) {
+  oldDataset.unpersist()
+}
+
 // copy model params
 Params.inheritValues(map, this, lrm)
 lrm
   }
 
-  private[ml] override def transformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
-validateAndTransformSchema(schema, paramMap, fitting = true)
-  }
+  override protected def featuresDataType: DataType = new VectorUDT
 }
 
+
 /**
  * :: AlphaComponent ::
+ *
  * Model produced by [[LogisticRegression]].
  */
 @AlphaComponent
 class LogisticRegressionModel private[ml] (
 override val parent: LogisticRegression,
--- End diff --

Why do models need to have a reference to the Estimator that produced them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-09 Thread tomerk
Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22752140
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala 
---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.SparkContext._
+import org.apache.spark.ml.classification.{Classifier, ClassifierParams, 
ClassificationModel}
+import org.apache.spark.ml.param.{Params, IntParam, ParamMap}
+import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.sql.{DataType, SchemaRDD, Row, SQLContext}
+
+/**
+ * A simple example demonstrating how to write your own learning algorithm 
using Estimator,
+ * Transformer, and other abstractions.
+ * This mimics [[org.apache.spark.ml.classification.LogisticRegression]].
+ * Run with
+ * {{{
+ * bin/run-example ml.DeveloperApiExample
+ * }}}
+ */
+object DeveloperApiExample {
+
+  def main(args: Array[String]) {
+val conf = new SparkConf().setAppName("DeveloperApiExample")
+val sc = new SparkContext(conf)
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Prepare training data.
+val training = sparkContext.parallelize(Seq(
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
+  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
+  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
+
+// Create a LogisticRegression instance.  This instance is an 
Estimator.
+val lr = new MyLogisticRegression()
+// Print out the parameters, documentation, and any default values.
+println("MyLogisticRegression parameters:\n" + lr.explainParams() + 
"\n")
+
+// We may set parameters using setter methods.
+lr.setMaxIter(10)
+
+// Learn a LogisticRegression model.  This uses the parameters stored 
in lr.
+val model = lr.fit(training)
+
+// Prepare test data.
+val test = sparkContext.parallelize(Seq(
+  LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
+  LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
+  LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5
+
+// Make predictions on test data.
+val sumPredictions: Double = model.transform(test)
+  .select('features, 'label, 'prediction)
+  .collect()
+  .map { case Row(features: Vector, label: Double, prediction: Double) 
=>
+prediction
+  }.sum
+assert(sumPredictions == 0.0,
+  "MyLogisticRegression predicted something other than 0, even though 
all weights are 0!")
+  }
+}
+
+/**
+ * Example of defining a parameter trait for a user-defined type of 
[[Classifier]].
+ *
+ * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
+ */
+private trait MyLogisticRegressionParams extends ClassifierParams {
+
+  /** param for max number of iterations */
+  val maxIter: IntParam = new IntParam(this, "maxIter", "max number of 
iterations")
+  def getMaxIter: Int = get(maxIter)
+}
+
+/**
+ * Example of defining a type of [[Classifier]].
+ *
+ * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
+ */
+private class MyLogisticRegression
+  extends Classifier[Vector, MyLogisticRegression, 
MyLogisticRegressionModel]
+  with MyLogisticRegressionParams {
+
+  setMaxIter(100) // Initialize
+
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+
+  override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
MyLogistic

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-09 Thread tomerk
Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22752063
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala 
---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.SparkContext._
+import org.apache.spark.ml.classification.{Classifier, ClassifierParams, 
ClassificationModel}
+import org.apache.spark.ml.param.{Params, IntParam, ParamMap}
+import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.sql.{DataType, SchemaRDD, Row, SQLContext}
+
+/**
+ * A simple example demonstrating how to write your own learning algorithm 
using Estimator,
+ * Transformer, and other abstractions.
+ * This mimics [[org.apache.spark.ml.classification.LogisticRegression]].
+ * Run with
+ * {{{
+ * bin/run-example ml.DeveloperApiExample
+ * }}}
+ */
+object DeveloperApiExample {
+
+  def main(args: Array[String]) {
+val conf = new SparkConf().setAppName("DeveloperApiExample")
+val sc = new SparkContext(conf)
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Prepare training data.
+val training = sparkContext.parallelize(Seq(
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
+  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
+  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
+
+// Create a LogisticRegression instance.  This instance is an 
Estimator.
+val lr = new MyLogisticRegression()
+// Print out the parameters, documentation, and any default values.
+println("MyLogisticRegression parameters:\n" + lr.explainParams() + 
"\n")
+
+// We may set parameters using setter methods.
+lr.setMaxIter(10)
+
+// Learn a LogisticRegression model.  This uses the parameters stored 
in lr.
+val model = lr.fit(training)
+
+// Prepare test data.
+val test = sparkContext.parallelize(Seq(
+  LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
+  LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
+  LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5
+
+// Make predictions on test data.
+val sumPredictions: Double = model.transform(test)
+  .select('features, 'label, 'prediction)
+  .collect()
+  .map { case Row(features: Vector, label: Double, prediction: Double) 
=>
+prediction
+  }.sum
+assert(sumPredictions == 0.0,
+  "MyLogisticRegression predicted something other than 0, even though 
all weights are 0!")
+  }
+}
+
+/**
+ * Example of defining a parameter trait for a user-defined type of 
[[Classifier]].
+ *
+ * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
+ */
+private trait MyLogisticRegressionParams extends ClassifierParams {
+
+  /** param for max number of iterations */
+  val maxIter: IntParam = new IntParam(this, "maxIter", "max number of 
iterations")
+  def getMaxIter: Int = get(maxIter)
+}
+
+/**
+ * Example of defining a type of [[Classifier]].
+ *
+ * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
+ */
+private class MyLogisticRegression
+  extends Classifier[Vector, MyLogisticRegression, 
MyLogisticRegressionModel]
+  with MyLogisticRegressionParams {
+
+  setMaxIter(100) // Initialize
+
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+
+  override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
MyLogistic

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-09 Thread tomerk
Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-69393149
  
After taking a look at this pull request I have a few thoughts on what 
could make it simpler for developers to create new Transformers and Estimators: 
- I can't quite tell from the conversation tab if the Typed vs. Untyped 
interface for Estimators has been resolved yet or not. My vote would be that 
regardless of if a strongly typed interface is exposed to the public, a 
strongly typed interface would be simpler for developers to extend. I 
personally think it would be reasonable to have a strongly typed interface that 
only some Estimators extend from if it's not being exposed to the public.  

- At the start of every transform and fit function developers have to 
remember to call:
```sh
transformSchema(dataset.schema, paramMap, logging = true)
val map = this.paramMap ++ paramMap
```  
It doesn't seem like it should be that hard to make protected functions 
developers need to extend which don't have to do this, and then have fit and 
transform execute these two lines before entering the appropriate protected 
method.

- If we parameterize all the sharedParams mixins, it may be possible to not 
have to implement setters.
e.g. if we defined HasRawPredictionCol along the lines of:
```
private[ml] trait HasRawPredictionCol[+T] extends Params {
  /** param for raw prediction column name */
  val rawPredictionCol: Param[String] =
new Param(this, "rawPredictionCol", "raw prediction (a.k.a. confidence) 
column name",
  Some("rawPrediction"))
  def getRawPredictionCol: String = get(rawPredictionCol)
  def setRawPredictionCol(value: String): T = set(rawPredictionCol, 
value).asInstanceOf[T]
}
```
and then passed in the f-bounded type (which we're capturing anyway) when 
specifying ```with HasRawPredictionCol[M]```, classes that mix it in don't have 
to specify the setters.

- Would it be reasonable to make it so UnaryTransformers and other strongly 
typed interfaces use TypeTags and ```ScalaReflection.schemaFor[T].dataType``` 
to figure out the appropriate input and output datatypes? Then specifying a new 
UnaryTransformer could be as easy as just implementing a new function, no need 
to figure out how the input and output types exactly translate into catalyst 
datatypes. This may have plenty of issues as I'm not familiar with spark sql, 
but it could allow syntactic sugar such as:
``` val tokenizer(sep: String) = Transformer[String, 
Seq[String]](_.split(sep))```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3996: Shade Jetty in Spark deliverables.

2014-11-19 Thread tomerk
Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3130#issuecomment-63707707
  
I'm also seeing an error in Spark Streaming when using "mvn package 
-DskipTests":

``` sh
[INFO] compiler plugin: 
BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[INFO] Compiling 80 Scala sources and 2 Java sources to 
/Users/tomerk11/Development/spark/streaming/target/scala-2.10/classes...
[ERROR] 
 while compiling: 
/Users/tomerk11/Development/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
during phase: erasure
 library version: version 2.10.4
compiler version: version 2.10.4
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk
Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62234569
  
Is there anything that can be done about these two lines that appear in 
every transform and fit function:
```scala
import dataset.sqlContext._
val map = this.paramMap ++ paramMap
```
Beyond being boilerplate, the second line is an issue because at least last 
I checked, ++ in ParamMap is marked as private[ml], so it can't be used by 
custom pipeline stages.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk
Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3099#discussion_r20034471
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/example/StandardScaler.scala ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.example
+
+import org.apache.spark.ml._
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.feature
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.sql.SchemaRDD
+import org.apache.spark.sql.catalyst.analysis.Star
+import org.apache.spark.sql.catalyst.dsl._
+import org.apache.spark.sql.catalyst.expressions.Row
+
+class StandardScaler extends Estimator[StandardScalerModel] with 
HasInputCol {
+
+  def setInputCol(value: String): this.type = { set(inputCol, value); this 
}
+
+  override val modelParams: StandardScalerModelParams = new 
StandardScalerModelParams {}
+
+  override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
StandardScalerModel = {
+import dataset.sqlContext._
+val map = this.paramMap ++ paramMap
--- End diff --

This bit of boilerplate that appears in every transform and fit 
implementation should probably be moved into the interface somehow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk
Github user tomerk commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62112717
  
At @shivaram's suggestion, I started porting over a simple text classifier 
pipeline that was already using an Estimator/Transformer abstraction of RDD[U] 
to RDD[V] transforms to this interface. The almost-complete port (the imports 
got messed up when moving files around) can be found at 
https://github.com/shivaram/spark-ml/commit/522aec73172b28a4bc1b22df030a459fddbd93dd.
 

Beyond what Shivaram already mentioned, here are my thoughts:

1. The trickiest bit by far was all of the implicit conversions. I ended up 
needing to use several types of implicit conversion imports (case class -> 
schema RDD, spark sql dsl, parameter map, etc.) They also got mysteriously 
deleted by the IDE as I moved files between projects. I ended up having to copy 
and paste these whenever appropriate because I couldn't keep track of them.

2. Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I 
also had to copy and paste code. It's unclear what syntax is valid and what 
isn't. For example, is saying "as outputCol" enough, or is "as 
Symbol(outputCol)" required?

3. There is a lot of boilerplate code. It was easier to write the 
Transformers in the form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so 
I fully agree with Shivaram on that front. Potentially, certain interfaces 
along those lines (iterator to iterator transformers that can be applied to 
RDDs using mappartitions) could make it easier to have transformers not depend 
on local Spark Contexts to execute.

4. I found the parameter mapping in estimators fairly verbose, I like 
Shivaram's idea of having the estimators pass everything to the transformers no 
matter what.

5. Estimators requiring the transformers they output to extend Model didn't 
make sense to me. Certain estimators, such as to choose only the most frequent 
tokens in a collection to keep for each document, don't seem like they should 
output models. On that front, should it be required for estimators to specify 
the type of transformer they output? It can be convenient sometimes to just 
inline an anonymous Transformer to output without making it a top-level class.

6. There are a lot of parameter traits: HasRegParam, HasMaxIter, 
HasScoreCol, HasFeatureCol Does it make sense to have this many specific 
parameter traits if we still have to maintain boilerplate setters code for Java 
anyway?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org