[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345934019 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: TYSM @srowen . Looking forward for more learning opportunities. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345934019 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: TYSM @srowen . Looking for more learning opportunities. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345933074 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: > are you sure it makes a difference meaningful enough to overcome the overhead? I think, no. Persisting `count` doesn't makes sense here. It will just be an overhead. Now I am getting clear picture of where to use persist. Key learnings from this PR about persist. - persist introduce memory and CPU overheads. - So only important inputs (such as intermediate results, user data which is already cached, etc) should be persisted or RDD on which more than one action is performed. - Avoid using persist in loop. - Persist should be meaningful enough to overcome overheads. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345922235 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: Oh, okay. One question here, will it be worth persisting `counts` since actions `count` and `collect` is applied directly on it ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345908753 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: I think `binnedWeights` is required to be persisted because more than one action is getting applied here. `binnedWeights` | | `sortByKey` (action) V `counts` | | `count` (action) V `binnedCounts` (on which action `collect` is applied to compute `agg`) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345910113 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: I might be wrong here. Kindly correct me @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345908753 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: I think `binnedWeights` is required to be persisted because `binnedWeights` | | `sortByKey` (action) V `counts` | | `count` (action) V `binnedCounts` (on which action `collect` is applied to compute `agg`) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345890498 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() +} +val counts = binnedWeights.sortByKey(ascending = false) Review comment: `binnedCounts` is a child RDD of `binnedWeights`. And here one action `sortByKey` is performed on `binnedWeights`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345880731 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") ( val partitionwiseCumulativeCounts = agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c) val totalCount = partitionwiseCumulativeCounts.last +binnedWeights.unpersist() Review comment: cc @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345880731 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") ( val partitionwiseCumulativeCounts = agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c) val totalCount = partitionwiseCumulativeCounts.last +binnedWeights.unpersist() Review comment: cc @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345879858 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") ( val partitionwiseCumulativeCounts = agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c) val totalCount = partitionwiseCumulativeCounts.last +binnedWeights.unpersist() Review comment: cc @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345879858 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") ( val partitionwiseCumulativeCounts = agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c) val totalCount = partitionwiseCumulativeCounts.last +binnedWeights.unpersist() Review comment: cc @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345863805 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { + binnedWeights.persist() Review comment: Do you mean it should be unpersisted after use? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345577403 ## File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ## @@ -141,8 +141,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) Some(Array.fill($(numFolds))(Array.fill[Model[_]](epm.length)(null))) } else None +val inputRDD = dataset.toDF.rdd +inputRDD.persist() Review comment: Yes, persist is not required. Training and testing datasets are already persisted. https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L147-L148 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345576591 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +165,15 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +binnedWeights.persist() Review comment: `binnedWeights` RDD needs to be persisted. At Line 176: `sortByKey()` and Line 216: `collect()` is applied. Also, it is not already persisted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345576591 ## File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala ## @@ -165,13 +165,15 @@ class BinaryClassificationMetrics @Since("3.0.0") ( confusions: RDD[(Double, BinaryConfusionMatrix)]) = { // Create a bin for each distinct score value, count weighted positives and // negatives within each bin, and then sort by score values in descending order. -val counts = scoreLabelsWeight.combineByKey( +val binnedWeights = scoreLabelsWeight.combineByKey( createCombiner = (labelAndWeight: (Double, Double)) => new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => c += (labelAndWeight._1, labelAndWeight._2), mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 -).sortByKey(ascending = false) +) +binnedWeights.persist() Review comment: `binnedWeights` RDD needs to be persisted. At Line 176: `sortByKey()` and Line 216: `collect()` is applied. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345575129 ## File path: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ## @@ -309,6 +309,7 @@ class Word2Vec extends Serializable with Logging { @Since("1.1.0") def fit[S <: Iterable[String]](dataset: RDD[S]): Word2VecModel = { +dataset.persist() Review comment: Again user input. No need to persist. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345573321 ## File path: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala ## @@ -217,6 +217,7 @@ object PrefixSpan extends Logging { data: RDD[Array[Array[Item]]], minCount: Long): Array[Item] = { +data.persist() Review comment: Just an user input, therefore persist not required. Its only caller https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala#L146 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345572178 ## File path: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ## @@ -41,6 +41,7 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) { */ @Since("1.4.0") def fit(sources: RDD[Vector]): PCAModel = { +sources.persist() Review comment: `sources` is not some intermediate results, it is just user input. Hence removing persist from here. https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L95 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345571669 ## File path: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ## @@ -250,6 +250,7 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +data.persist() Review comment: Again, `data` is not some intermediate results, it is just user input. Hence removing persist from here. https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala#L214 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345570774 ## File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ## @@ -138,6 +138,7 @@ final class EMLDAOptimizer extends LDAOptimizer { this.docConcentration = if (docConcentration == -1) (50.0 / k) + 1.0 else docConcentration this.topicConcentration = if (topicConcentration == -1) 1.1 else topicConcentration val randomSeed = lda.getSeed +docs.persist() Review comment: `docs` is user input hence persist not required. https://github.com/apache/spark/blob/80fbc382a60973db21367a922a0fb797e5ab382d/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala#L331 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345569151 ## File path: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ## @@ -122,6 +122,7 @@ private[spark] object RandomForest extends Logging with Serializable { timer.start("init") val retaggedInput = input.retag(classOf[Instance]) +retaggedInput.persist() Review comment: > We add these internal persist calls for important partial results though, not user inputs, in general, and only if the user input was cached. `retaggedInput` is just transformed user data (`input`) and further its partial result (`baggedInput`) is persisted at line 158. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345268698 ## File path: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ## @@ -122,6 +122,7 @@ private[spark] object RandomForest extends Logging with Serializable { timer.start("init") val retaggedInput = input.retag(classOf[Instance]) +retaggedInput.persist() Review comment: Here `retaggedInput` is not required to persist since only action applied on it, is in `buildMetadata()` and even `baggedInput` is already persisted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r345262341 ## File path: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ## @@ -927,6 +927,7 @@ object ALS extends DefaultParamsReadable[ALS] with Logging { require(intermediateRDDStorageLevel != StorageLevel.NONE, "ALS is not designed to run without persisting intermediate RDDs.") +ratings.persist() Review comment: Here also persist is not required. And `blockRatings` which is a child RDD is already persisted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r344718067 ## File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ## @@ -141,8 +141,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) Some(Array.fill($(numFolds))(Array.fill[Model[_]](epm.length)(null))) } else None +val inputRDD = dataset.toDF.rdd +inputRDD.persist() Review comment: I think, here persist is required? Since training and testing datasets are dependent on them and in further code, these datasets have been used repeatedly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD
amanomer commented on a change in pull request #26454: [SPARK-29818][MLLIB] Missing persist on RDD URL: https://github.com/apache/spark/pull/26454#discussion_r344481010 ## File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ## @@ -141,8 +141,10 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) Some(Array.fill($(numFolds))(Array.fill[Model[_]](epm.length)(null))) } else None +val inputRDD = dataset.toDF.rdd +inputRDD.persist() Review comment: > Persisting intermediate results is not always good Kind request, Can you explain a case for this? and can this be solved by changing the storage level? A case when persisting intermediate result would be inefficient is, when dataset is larger than memory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org