[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345934019
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   TYSM @srowen . Looking forward for more learning opportunities.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345934019
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   TYSM @srowen . Looking for more learning opportunities.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345933074
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   > are you sure it makes a difference meaningful enough to overcome the 
overhead?
   
   I think, no. Persisting `count` doesn't makes sense here. It will just be an 
overhead. Now I am getting clear picture of where to use persist. 
   Key learnings from this PR about persist.
   
   - persist introduce memory and CPU overheads.
   
   - So only important inputs (such as intermediate results, user data which is 
already cached, etc) should be persisted or RDD on which more than one action 
is performed.
   
   - Avoid using persist in loop.
   
   - Persist should be meaningful enough to overcome overheads.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345922235
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   Oh, okay. One question here, will it be worth persisting `counts` since 
actions `count` and `collect` is applied directly on it ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345908753
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   I think `binnedWeights` is required to be persisted because more than one 
action is getting applied here.
   
   `binnedWeights`
     |
     | `sortByKey` (action)
    V
   `counts`
     |
     | `count` (action)
    V
   `binnedCounts` (on which action `collect` is applied to compute `agg`)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345910113
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   I might be wrong here. Kindly correct me @srowen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345908753
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   I think `binnedWeights` is required to be persisted because
   
   `binnedWeights`
     |
     | `sortByKey` (action)
    V
   `counts`
     |
     | `count` (action)
    V
   `binnedCounts` (on which action `collect` is applied to compute `agg`)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345890498
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
+}
+val counts = binnedWeights.sortByKey(ascending = false)
 
 Review comment:
   `binnedCounts` is a child RDD of `binnedWeights`. And here one action 
`sortByKey` is performed on `binnedWeights`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345880731
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 val partitionwiseCumulativeCounts =
   agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c)
 val totalCount = partitionwiseCumulativeCounts.last
+binnedWeights.unpersist()
 
 Review comment:
   cc @srowen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345880731
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 val partitionwiseCumulativeCounts =
   agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c)
 val totalCount = partitionwiseCumulativeCounts.last
+binnedWeights.unpersist()
 
 Review comment:
   cc @srowen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345879858
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 val partitionwiseCumulativeCounts =
   agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c)
 val totalCount = partitionwiseCumulativeCounts.last
+binnedWeights.unpersist()
 
 Review comment:
   cc @srowen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345879858
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 val partitionwiseCumulativeCounts =
   agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c)
 val totalCount = partitionwiseCumulativeCounts.last
+binnedWeights.unpersist()
 
 Review comment:
   cc @srowen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-13 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345863805
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+  binnedWeights.persist()
 
 Review comment:
   Do you mean it should be unpersisted after use?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345577403
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
 ##
 @@ -141,8 +141,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   Some(Array.fill($(numFolds))(Array.fill[Model[_]](epm.length)(null)))
 } else None
 
+val inputRDD = dataset.toDF.rdd
+inputRDD.persist()
 
 Review comment:
   Yes, persist is not required. Training and testing datasets are already 
persisted.
   
https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L147-L148


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345576591
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +165,15 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+binnedWeights.persist()
 
 Review comment:
   `binnedWeights` RDD needs to be persisted. At Line 176: `sortByKey()` and 
Line 216: `collect()` is applied.  Also, it is not already persisted.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345576591
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##
 @@ -165,13 +165,15 @@ class BinaryClassificationMetrics @Since("3.0.0") (
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
 // Create a bin for each distinct score value, count weighted positives and
 // negatives within each bin, and then sort by score values in descending 
order.
-val counts = scoreLabelsWeight.combineByKey(
+val binnedWeights = scoreLabelsWeight.combineByKey(
   createCombiner = (labelAndWeight: (Double, Double)) =>
 new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, 
labelAndWeight._2),
   mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
 c += (labelAndWeight._1, labelAndWeight._2),
   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
-).sortByKey(ascending = false)
+)
+binnedWeights.persist()
 
 Review comment:
   `binnedWeights` RDD needs to be persisted. At Line 176: `sortByKey()` and 
Line 216: `collect()` is applied. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345575129
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
 ##
 @@ -309,6 +309,7 @@ class Word2Vec extends Serializable with Logging {
   @Since("1.1.0")
   def fit[S <: Iterable[String]](dataset: RDD[S]): Word2VecModel = {
 
+dataset.persist()
 
 Review comment:
   Again user input. No need to persist.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345573321
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
 ##
 @@ -217,6 +217,7 @@ object PrefixSpan extends Logging {
   data: RDD[Array[Array[Item]]],
   minCount: Long): Array[Item] = {
 
+data.persist()
 
 Review comment:
   Just an user input, therefore persist not required. Its only caller
   
https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala#L146


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345572178
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
 ##
 @@ -41,6 +41,7 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
*/
   @Since("1.4.0")
   def fit(sources: RDD[Vector]): PCAModel = {
+sources.persist()
 
 Review comment:
   `sources` is not some intermediate results, it is just user input. Hence 
removing persist from here.
   
https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L95


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345571669
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
 ##
 @@ -250,6 +250,7 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+data.persist()
 
 Review comment:
   Again, `data` is not some intermediate results, it is just user input. Hence 
removing persist from here.
   
https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala#L214


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345570774
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
 ##
 @@ -138,6 +138,7 @@ final class EMLDAOptimizer extends LDAOptimizer {
 this.docConcentration = if (docConcentration == -1) (50.0 / k) + 1.0 else 
docConcentration
 this.topicConcentration = if (topicConcentration == -1) 1.1 else 
topicConcentration
 val randomSeed = lda.getSeed
+docs.persist()
 
 Review comment:
   `docs` is user input hence persist not required.
   
https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala#L331


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345569151
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
 ##
 @@ -122,6 +122,7 @@ private[spark] object RandomForest extends Logging with 
Serializable {
 timer.start("init")
 
 val retaggedInput = input.retag(classOf[Instance])
+retaggedInput.persist()
 
 Review comment:
   > We add these internal persist calls for important partial results though, 
not user inputs, in general, and only if the user input was cached.
   
   `retaggedInput` is just transformed user data (`input`) and further its 
partial result (`baggedInput`) is persisted at line 158.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345268698
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
 ##
 @@ -122,6 +122,7 @@ private[spark] object RandomForest extends Logging with 
Serializable {
 timer.start("init")
 
 val retaggedInput = input.retag(classOf[Instance])
+retaggedInput.persist()
 
 Review comment:
   Here `retaggedInput` is not required to persist since only action applied on 
it, is in `buildMetadata()` and even `baggedInput` is already persisted.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-12 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r345262341
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 ##
 @@ -927,6 +927,7 @@ object ALS extends DefaultParamsReadable[ALS] with Logging 
{
 require(intermediateRDDStorageLevel != StorageLevel.NONE,
   "ALS is not designed to run without persisting intermediate RDDs.")
 
+ratings.persist()
 
 Review comment:
   Here also persist is not required. And `blockRatings` which is a child RDD 
is already persisted.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-11 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r344718067
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
 ##
 @@ -141,8 +141,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   Some(Array.fill($(numFolds))(Array.fill[Model[_]](epm.length)(null)))
 } else None
 
+val inputRDD = dataset.toDF.rdd
+inputRDD.persist()
 
 Review comment:
   I think, here persist is required? Since training and testing datasets are 
dependent on them and in further code, these datasets have been used 
repeatedly.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD

2019-11-10 Thread GitBox
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] 
Missing persist on RDD
URL: https://github.com/apache/spark/pull/26454#discussion_r344481010
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
 ##
 @@ -141,8 +141,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   Some(Array.fill($(numFolds))(Array.fill[Model[_]](epm.length)(null)))
 } else None
 
+val inputRDD = dataset.toDF.rdd
+inputRDD.persist()
 
 Review comment:
   > Persisting intermediate results is not always good
   
   Kind request, Can you explain a case for this? and can this be solved by 
changing the storage level?
   A case when persisting intermediate result would be inefficient is, when 
dataset is larger than memory.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org