[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

tomerk Mon, 26 Jan 2015 13:54:08 -0800

Github user tomerk commented on the pull request:

    https://github.com/apache/spark/pull/3637#issuecomment-71546339
  
    Well, from my perspective an ideal interface for scala-only support for the 
developer API example would look something like as follows:
    
    ```scala
    /**
     * Example of defining a type of [[Classifier]].
     *
     * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
     */
    private class MyLogisticRegression
      extends Classifier[Vector]
      with MaxIterParam(100) { // Initialize default value of MaxIter
    
      // This method is used by fit()
      override protected def train(
          dataset: SchemaRDD,
          params: ParamMap): MyLogisticRegressionModel = {
        // Extract columns from data using helper method.
        val oldDataset = extractLabeledPoints(dataset, params)
    
        // Do learning to estimate the weight vector.
        val numFeatures = oldDataset.take(1)(0).features.size
        val weights = Vectors.zeros(numFeatures) // Learning would happen here.
    
        // Create a model, and return it.
        new MyLogisticRegressionModel(weights)
      }
    }
    
    /**
     * Example of defining a type of [[ClassificationModel]].
     *
     * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
     */
    private class MyLogisticRegressionModel(val weights: Vector)
      extends ClassificationModel[Vector]  {
    
      // This uses the default implementation of transform(), which reads 
column "features" and outputs
      // columns "prediction" and "rawPrediction."
    
      // This uses the default implementation of predict(), which chooses the 
label corresponding to
      // the maximum value returned by [[predictRaw()]].
    
      /**
       * Raw prediction for each possible label.
       * The meaning of a "raw" prediction may vary between algorithms, but it 
intuitively gives
       * a measure of confidence in each possible label (where larger = more 
confident).
       * This internal method is used to implement [[transform()]] and output 
[[rawPredictionCol]].
       *
       * @return  vector where element i is the raw prediction for label i.
       *          This raw prediction may be any real number, where a larger 
value indicates greater
       *          confidence for that label.
       */
      override protected def predictRaw(features: Vector): Vector = {
        val margin = BLAS.dot(features, weights)
        // There are 2 classes (binary classification), so we return a length-2 
vector,
        // where index i corresponds to class i (i = 0, 1).
        Vectors.dense(-margin, margin)
      }
    
      /** Number of classes the label can take.  2 indicates binary 
classification. */
      override val numClasses: Int = 2
    }
    ```
    
    I guess some things of note here are:
    - Less parameter trickiness (Already discussed)
    - Less generics needed everywhere (Thanks to scala type inference & 
this.type)
    - No need for developers to specify their own copy method (which would 
require developers to remember that the parameter map requires a deep copy), it 
would just happen in the background somehow
    - No need to specify the "fittingParamMap" in the transformer's definition, 
the background stuff should automatically pass everything along to the 
transformer
    
    Like I said, I have doubts about how much of this can be done because of 
the need to support Java.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

Reply via email to