[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

feynmanliang Sat, 15 Aug 2015 16:07:52 -0700

Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8197#discussion_r37140382
  
    --- Diff: docs/ml-guide.md ---
    @@ -801,6 +801,141 @@ jsc.stop();
     
     </div>
     
    +## Examples: Summaries for LogisticRegression.
    +
    +Once LogisticRegression is run on data, it is useful to extract statistics 
such as the
    +loss per iteration which will provide an intuition on overfitting and 
metrics to understand
    +how well the model has performed on training and test data.
    +
    +LogisticRegressionTrainingSummary provides an interface to access such 
relevant information
    +i.e the objectiveHistory and metrics to evaluate the performance on the 
training data
    +directly with very less code to be rewritten by the user.
    +In the future, a method would be made available in the fitted 
LogisticRegressionModel to obtain
    +a LogisticRegressionSummary of the test data as well.
    +
    +This examples illustrates the use of LogisticRegressionTrainingSummary on 
some toyData.
    +
    +<div class="codetabs">
    +<div data-lang="scala">
    +{% highlight scala %}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.ml.classification.{LogisticRegression, 
BinaryLogisticRegressionSummary}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.sql.{Row, SQLContext}
    +
    +val conf = new SparkConf().setAppName("LogisticRegressionSummary")
    +val sc = new SparkContext(conf)
    +val sqlContext = new SQLContext(sc)
    +import sqlContext.implicits._
    +
    +// Use some random data for demonstration.
    +// Note that the RDD of LabeledPoints can be converted to a dataframe 
directly.
    +val data = sc.parallelize(Array(
    +  LabeledPoint(0.0, Vectors.dense(0.2, 4.5, 1.6)),
    +  LabeledPoint(1.0, Vectors.dense(3.1, 6.8, 3.6)),
    +  LabeledPoint(0.0, Vectors.dense(2.4, 0.9, 1.9)),
    +  LabeledPoint(1.0, Vectors.dense(9.1, 3.1, 3.6)),
    +  LabeledPoint(0.0, Vectors.dense(2.5, 1.9, 9.1)))
    +)
    +val logRegDataFrame = data.toDF()
    +
    +// Run Logistic Regression on your toy data.
    +// Since LogisticRegression is an estimator, it returns an instance of 
LogisticRegressionModel
    +// which is a transformer.
    +val logReg = new LogisticRegression()
    +logReg.setMaxIter(5)
    +logReg.setRegParam(0.01)
    +val logRegModel = logReg.fit(logRegDataFrame)
    +
    +// Extract the summary directly from the returned LogisticRegressionModel 
instance.
    +val trainingSummary = logRegModel.summary
    +
    +// Obtain the loss per iteration. This should decrease upto a certain 
point and
    +// then increase or show negligible change after this.
    +val objectiveHistory = trainingSummary.objectiveHistory
    +objectiveHistory.foreach(loss => println(loss))
    +
    +// Obtain the metrics useful to judge performance on test data.
    +val binarySummary = 
trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary]
    +
    +// Obtain the receiver-operating characteristic as a dataframe and 
areaUnderROC.
    +val roc = binarySummary.roc
    +val truePositiveRate = roc.select("FPR")
    +val area = binarySummary.areaUnderROC
    +
    +// Obtain the threshold with the highest fMeasure.
    +val fMeasure = binarySummary.fMeasureByThreshold
    +val fScoreRDD = fMeasure.map { case Row(thresh: Double, fscore: Double) => 
(thresh, fscore) }
    +val (highThresh, highFScore) = fScoreRDD.fold((0.0, 0.0))((threshFScore1, 
threshFScore2) => {
    +  if (threshFScore1._2 > threshFScore2._2) threshFScore1 else threshFScore2
    +})
    --- End diff --
    
    Nice!
    
    I think what you described is useful, but is outside the scope of 
`LogisticRegressionSummary`. L869-L872 don't demonstrate any of the 
functionality these docs are intended to describe, which is why I propose we 
remove it. What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9906] [ML] User guide for LogisticRegre...

Reply via email to