[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22894#discussion_r231122195 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -private var hugeBlockSizes: Map[Int, Byte]) +private[this] var hugeBlockSizes: mutable.Map[Int, Byte]) --- End diff -- It's mutated though, and needs to be mutable. If it were exposed outside the class, or there was significant danger of accidentally mutating it elsewhere, I think it might be necessary to wrap the result in an immutable wrapper, but here this seems OK to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22894#discussion_r231121793 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -189,13 +188,12 @@ private[spark] class HighlyCompressedMapStatus private ( emptyBlocks.readExternal(in) avgSize = in.readLong() val count = in.readInt() -val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]() +hugeBlockSizes = mutable.Map.empty[Int, Byte] (0 until count).foreach { _ => val block = in.readInt() val size = in.readByte() - hugeBlockSizesArray += Tuple2(block, size) + hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, size) --- End diff -- Yes, it is a mutable map and used as a mutable map. Its type must reflect that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22950: [MINOR] Fix typos and misspellings
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22950 Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17086#discussion_r230949468 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala --- @@ -67,6 +68,10 @@ class MulticlassClassificationEvaluator @Since("1.5.0") (@Since("1.5.0") overrid @Since("1.5.0") def setLabelCol(value: String): this.type = set(labelCol, value) + /** @group setParam */ + @Since("2.4.0") --- End diff -- Will need to be 3.0.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17086#discussion_r230949719 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala --- @@ -27,10 +27,17 @@ import org.apache.spark.sql.DataFrame /** * Evaluator for multiclass classification. * - * @param predictionAndLabels an RDD of (prediction, label) pairs. + * @param predAndLabelsWithOptWeight an RDD of (prediction, label, weight) or + * (prediction, label) pairs. */ @Since("1.1.0") -class MulticlassMetrics @Since("1.1.0") (predictionAndLabels: RDD[(Double, Double)]) { +class MulticlassMetrics @Since("2.4.0") (predAndLabelsWithOptWeight: RDD[_]) { --- End diff -- Although I think we're not updating .mllib much at all now, I think this is a simple and backwards-compatible change so think it's OK. It is the implementation behind the .ml version anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/17086#discussion_r230949430 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala --- @@ -18,10 +18,14 @@ package org.apache.spark.mllib.evaluation import org.apache.spark.SparkFunSuite -import org.apache.spark.mllib.linalg.Matrices +import org.apache.spark.ml.linalg.Matrices +import org.apache.spark.ml.util.TestingUtils._ import org.apache.spark.mllib.util.MLlibTestSparkContext class MulticlassMetricsSuite extends SparkFunSuite with MLlibTestSparkContext { + + val delta = 1e-7 --- End diff -- `private`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22087: [SPARK-25097][ML] Support prediction on single in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22087#discussion_r230948830 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -117,7 +117,8 @@ class BisectingKMeansModel private[ml] ( validateAndTransformSchema(schema) } - private[clustering] def predict(features: Vector): Int = parentModel.predict(features) + @Since("2.4.0") --- End diff -- This would have to be 3.0.0. I don't see a good reason not to expose this. CCing also maybe @mgaido91 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22914 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22940 I'm neutral on it... seems like a logical change but is there any issue (like ambiguous imports that are annoying) other than not matching the pattern? I am not super concerned about back-porting conflicts, but still it's a non-trivial concern. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230784165 --- Diff: R/pkg/R/generics.R --- @@ -748,7 +748,7 @@ setGeneric("add_months", function(y, x) { standardGeneric("add_months") }) #' @rdname column_aggregate_functions #' @name NULL -setGeneric("approxCountDistinct", function(x, ...) { standardGeneric("approxCountDistinct") }) +setGeneric("approx_count_distinct", function(x, ...) { standardGeneric("approx_count_distinct") }) --- End diff -- I think my comment didn't get connected to this one -- @felixcheung what do you think about the argument that this almost surely was meant to be deprecated along with counterparts in Scala/Python? leaving them in would make this inconsistent. As the degrees, radians, and approxCountDistinct are reasonably niche and have a direct replacement that's compatible with older versions, I feel like this is OK for 3.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22950: [MINOR] Fix typos and misspellings
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/22950 [MINOR] Fix typos and misspellings ## What changes were proposed in this pull request? Fix typos and misspellings, per https://github.com/apache/spark-website/pull/158#issuecomment-435790366 ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark Typos Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22950.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22950 commit 8ebc3c038b72db726868b005ebb7ee5cca78cdbd Author: Sean Owen Date: 2018-11-05T15:02:34Z Fix typos and misspellings --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22931: [SPARK-25930][K8s] Fix scala string detection in k8s tes...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22931 Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22921 Yeah it's a good point that these weren't deprecated, but I assume they should have been. Same change, same time, same logic. given that it's a reasonably niche method, I thought it would be best to go ahead and be consistent here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230568378 --- Diff: R/pkg/R/functions.R --- @@ -319,23 +319,23 @@ setMethod("acos", }) #' @details -#' \code{approxCountDistinct}: Returns the approximate number of distinct items in a group. +#' \code{approx_count_distinct}: Returns the approximate number of distinct items in a group. #' #' @rdname column_aggregate_functions -#' @aliases approxCountDistinct approxCountDistinct,Column-method +#' @aliases approx_count_distinct approx_count_distinct,Column-method #' @examples #' #' \dontrun{ -#' head(select(df, approxCountDistinct(df$gear))) -#' head(select(df, approxCountDistinct(df$gear, 0.02))) +#' head(select(df, approx_count_distinct(df$gear))) +#' head(select(df, approx_count_distinct(df$gear, 0.02))) #' head(select(df, countDistinct(df$gear, df$cyl))) #' head(select(df, n_distinct(df$gear))) #' head(distinct(select(df, "gear")))} -#' @note approxCountDistinct(Column) since 1.4.0 -setMethod("approxCountDistinct", +#' @note approx_count_distinct(Column) since 2.0.0 --- End diff -- Right, will fix that one too if I missed it, per https://github.com/apache/spark/pull/22921#discussion_r230449173 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22933: [SPARK-25933][DOCUMENTATION] Fix pstats.Stats() referenc...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22933 Merged to master/2.4/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22893 OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running. This doesn't show an improvement, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22894#discussion_r230556818 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -189,13 +188,12 @@ private[spark] class HighlyCompressedMapStatus private ( emptyBlocks.readExternal(in) avgSize = in.readLong() val count = in.readInt() -val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]() +hugeBlockSizes = mutable.Map.empty[Int, Byte] (0 until count).foreach { _ => val block = in.readInt() val size = in.readByte() - hugeBlockSizesArray += Tuple2(block, size) + hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, size) --- End diff -- Why cast it? it is used as a mutable map and its type is a mutable map, so the type on line 151 is wrong. Also, just `hugeBlockSizes(block) = size`, no? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22934: [BUILD] Close stale PRs
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22934 Maybe add https://github.com/apache/spark/pull/22849 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22855: [SPARK-25839] [Core] Implement use of KryoPool in KryoSe...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22855 Oops, something to do with the benchmark class: ``` [error] /home/jenkins/workspace/NewSparkPullRequestBuilder/core/src/test/scala/org/apache/spark/serializer/KryoSerializerBenchmark.scala:41: object creation impossible, since method runBenchmarkSuite in class BenchmarkBase of type (mainArgs: Array[String])Unit is not defined [error] object KryoSerializerBenchmark extends BenchmarkBase { [error]^ [error] /home/jenkins/workspace/NewSparkPullRequestBuilder/core/src/test/scala/org/apache/spark/serializer/KryoSerializerBenchmark.scala:45: method runBenchmarkSuite overrides nothing. [error] Note: the super classes of object KryoSerializerBenchmark contain the following, non final members named runBenchmarkSuite: [error] def runBenchmarkSuite(mainArgs: Array[String]): Unit [error] override def runBenchmarkSuite(): Unit = { [error]^ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and us...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22892 @cloud-fan @ueshin I'm not sure, but I'm seeing this failure regularly in master builds and I wonder if this could be the cause? in two builds it started failing with this (and another) change: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5561/testReport/ https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/574/testReport/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22864 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22931: [SPARK-25930][K8s] Fix scala string detection in k8s tes...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22931 OK. Does this need to go in branch 2.4 too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230449544 --- Diff: python/pyspark/sql/functions.py --- @@ -275,15 +273,6 @@ def _(): del _name, _doc -@since(1.3) -def approxCountDistinct(col, rsd=None): --- End diff -- Yeah, in some cases the deprecated user-facing method was named the same way as some internal method and I changed the wrong one. I'll investigate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230449173 --- Diff: R/pkg/R/functions.R --- @@ -1641,30 +1641,30 @@ setMethod("tanh", }) #' @details -#' \code{toDegrees}: Converts an angle measured in radians to an approximately equivalent angle +#' \code{degrees}: Converts an angle measured in radians to an approximately equivalent angle #' measured in degrees. #' #' @rdname column_math_functions -#' @aliases toDegrees toDegrees,Column-method -#' @note toDegrees since 1.4.0 -setMethod("toDegrees", +#' @aliases degrees degrees,Column-method +#' @note degrees since 2.1.0 --- End diff -- degrees was added in Scala/Python in 2.1.0, which is what I was thinking of, but yeah really this must be since 3.0.0 right? I'll fix it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230436963 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -279,27 +254,6 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { structDf.select(hash($"a", $"record.*"))) } - test("Star Expansion - explode should fail with a meaningful message if it takes a star") { --- End diff -- Yeah I'll try to bring back to the test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22893 So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened. Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse. How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here? This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22922: [SPARK-25909] fix documentation on cluster managers
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22922 Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r230421073 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -84,6 +85,7 @@ class KryoSerializer(conf: SparkConf) private val avroSchemas = conf.getAvroSchema // whether to use unsafe based IO for serialization private val useUnsafe = conf.getBoolean("spark.kryo.unsafe", false) + private val usePool = conf.getBoolean("spark.kryo.pool", false) --- End diff -- Yeah I think that's a fine position to take, if we can't think of a reason to disable it other than the theoretical unknown unknown bug out there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22906: [SPARK-25895][Core]Adding testcase to compare Lz4 and Zs...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22906 Benchmarks aren't run with tests right? That's fine. I am still not sure Spark is the best place for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22919 Tough call. At least it's worth documenting that this behavior changed because so did the Scala shell's behavior, and I'd support that. I'd also support supporting both, if there's no real downside to that option? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22924: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22924 Yeah I get that. I haven't heard the bugs that are fixed here impact users, but a few sound kind of bad. Maybe this should depend more on: how much do people need Python 3.7 in 2.4.x? that seems like a valid basis to back-port, if it were in demand. A 'bug fix' of a different sort, fixing incompatibilities with 3.7 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22864 Jenkins add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22864 That's a spurious failure, something's wrong with the build machine --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22922: [SPARK-25909] fix documentation on cluster manage...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22922#discussion_r230159527 --- Diff: docs/cluster-overview.md --- @@ -45,7 +45,7 @@ There are several useful things to note about this architecture: # Cluster Manager Types -The system currently supports three cluster managers: +The system currently supports four cluster managers: --- End diff -- Heh OK. Let's just say 'several' to reduce future maintenance. Anything else that can be fixed in this doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22758: [SPARK-25332][SQL] Instead of broadcast hash join ,Sort ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22758 I don't know this code well enough to review. I think there is skepticism from people who know this code whether this is change is correct and beneficial. If there's doubt, I think it should be closed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230156120 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -62,17 +62,6 @@ class SQLContext private[sql](val sparkSession: SparkSession) sparkSession.sparkContext.assertNotStopped() - // Note: Since Spark 2.0 this class has become a wrapper of SparkSession, where the --- End diff -- Makes sense, will add it back. Yeah will leave this open a short while to make sure there is time to comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230155788 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -639,20 +639,6 @@ private[spark] object SparkConf extends Logging { */ private val deprecatedConfigs: Map[String, DeprecatedConfig] = { val configs = Seq( - DeprecatedConfig("spark.cache.class", "0.8", --- End diff -- Yeah I can add them back. Wasn't sure whether they are still valuable or just old. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22924: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22924 I think it's OK to backport at 3.7 support is actually fairly important, and this also fixes bugs: https://www.py4j.org/changelog.html It drops 2.6 support but we did so as well a long time ago in 2.2.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22909: [SPARK-25897][k8s] Hook up k8s integration tests to sbt ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22909 I don't know SBT much myself. I would believe this works. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22914 Yes if the behavior is consistent across pages that sounds like fine behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22908: [MONOR][SQL] Replace all TreeNode's node name in ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22908#discussion_r230060094 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -56,7 +56,7 @@ case class DataSourceV2Relation( override def pushedFilters: Seq[Expression] = Seq.empty - override def simpleString: String = "RelationV2 " + metadataString + override def simpleString: String = s"$nodeName " + metadataString --- End diff -- This uses interpolation and concatenation - just interpolate --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230057306 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala --- @@ -246,7 +246,7 @@ class MathExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { test("toDegrees") { testUnary(ToDegrees, math.toDegrees) -checkConsistencyBetweenInterpretedAndCodegen(Acos, DoubleType) +checkConsistencyBetweenInterpretedAndCodegen(ToDegrees, DoubleType) --- End diff -- This was just a bug fix I spotted --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230057481 --- Diff: R/pkg/R/generics.R --- @@ -748,7 +748,7 @@ setGeneric("add_months", function(y, x) { standardGeneric("add_months") }) #' @rdname column_aggregate_functions #' @name NULL -setGeneric("approxCountDistinct", function(x, ...) { standardGeneric("approxCountDistinct") }) +setGeneric("approx_count_distinct", function(x, ...) { standardGeneric("approx_count_distinct") }) --- End diff -- @felixcheung might want to check if I'm handling these R changes correctly --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/22921 [SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 ## What changes were proposed in this pull request? - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove toDegrees/toRadians in favor of degrees/radians - Remove approxCountDistinct in favor of approx_count_distinct - Remove unused Python StorageLevel constants - Remove Dataset unionAll in favor of union - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session - Remove Column.!== in favor of =!= - Remove Dataset.explode - Remove Dataset.registerTempTable - Remove SQLContext.getOrCreate, setActive, clearActive, constructors Not touched yet - everything else in MLLib - HiveContext - Anything deprecated more recently than 2.0.0, generally ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-25908 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22921.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22921 commit 259e7d15f302a02fbf09dbc63f46cd1d690b1c2a Author: Sean Owen Date: 2018-11-01T14:18:29Z Remove many older deprecated items in Spark 3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22813 I just think that if you have engineers randomly writing and reading stuff in this dir, a bunch of other stuff goes wrong. This is not a problem that Spark can reasonably solve. Certainly, you have much bigger production problems if this level of discipline can't be enforced. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22893 Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22876: [SPARK-25869] [YARN] the original diagnostics is ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22876#discussion_r230053125 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -293,6 +293,9 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments) extends } if (!unregistered) { + logInfo("Waiting for " + sparkConf.get("spark.yarn.report.interval", "1000").toInt +"ms to unregister am," + --- End diff -- Use interpolation. No need to call `.toInt`. am -> AM or Application master. msg -> message; these should be complete sentences. You get the config twice here. Is this the default used elsewhere? I don't know this code well and don't think a Thread.sleep is a great way to coordinate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r230052150 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -84,6 +85,7 @@ class KryoSerializer(conf: SparkConf) private val avroSchemas = conf.getAvroSchema // whether to use unsafe based IO for serialization private val useUnsafe = conf.getBoolean("spark.kryo.unsafe", false) + private val usePool = conf.getBoolean("spark.kryo.pool", false) --- End diff -- I would not document it. This is just a safety valve. In theory, there's no reason to disable this nor would a caller know why to disable it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22914 Returning to the current or last page seems fine too. No error message really needed IMHO. However it sounds like other pages already go to the 1st page on invalid input? I'd also like to stay consistent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22894#discussion_r229812240 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private ( emptyBlocks.readExternal(in) avgSize = in.readLong() val count = in.readInt() -val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]() +hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala --- End diff -- Does calling `HashMap.sizeHint(...)` here actually help? I'd stick to the scala collection for now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22894#discussion_r229776086 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private ( emptyBlocks.readExternal(in) avgSize = in.readLong() val count = in.readInt() -val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]() +hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala --- End diff -- How about just scala's mutable Map? I'd expect it's no slower than Java's, given it might specialize for primitives (not sure about this) and sometimes has smarter implementations internally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r229738657 --- Diff: core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala --- @@ -456,9 +458,63 @@ class KryoSerializerSuite extends SparkFunSuite with SharedSparkContext { // Regression test for SPARK-7766, an issue where disabling auto-reset and enabling // reference-tracking would lead to corrupted output when serializer instances are re-used - for (referenceTracking <- Set(true, false); autoReset <- Set(true, false)) { -test(s"instance reuse with autoReset = $autoReset, referenceTracking = $referenceTracking") { - testSerializerInstanceReuse(autoReset = autoReset, referenceTracking = referenceTracking) + for { +referenceTracking <- Set(true, false) +autoReset <- Set(true, false) +usePool <- Set(true, false) + } { +test(s"instance reuse with autoReset = $autoReset, referenceTracking = $referenceTracking" + + s", usePool = $usePool") { + testSerializerInstanceReuse( +autoReset = autoReset, referenceTracking = referenceTracking, usePool = usePool) +} + } + + test("SPARK-25839 KryoPool implementation works correctly in multi-threaded environment") { +import java.util.concurrent.Executors --- End diff -- I'd import at the top of the file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r229740163 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -84,6 +85,7 @@ class KryoSerializer(conf: SparkConf) private val avroSchemas = conf.getAvroSchema // whether to use unsafe based IO for serialization private val useUnsafe = conf.getBoolean("spark.kryo.unsafe", false) + private val usePool = conf.getBoolean("spark.kryo.pool", false) --- End diff -- Yep, I'd leave this undocumented --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r229739685 --- Diff: core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala --- @@ -456,9 +458,63 @@ class KryoSerializerSuite extends SparkFunSuite with SharedSparkContext { // Regression test for SPARK-7766, an issue where disabling auto-reset and enabling // reference-tracking would lead to corrupted output when serializer instances are re-used - for (referenceTracking <- Set(true, false); autoReset <- Set(true, false)) { -test(s"instance reuse with autoReset = $autoReset, referenceTracking = $referenceTracking") { - testSerializerInstanceReuse(autoReset = autoReset, referenceTracking = referenceTracking) + for { +referenceTracking <- Set(true, false) +autoReset <- Set(true, false) +usePool <- Set(true, false) + } { +test(s"instance reuse with autoReset = $autoReset, referenceTracking = $referenceTracking" + + s", usePool = $usePool") { + testSerializerInstanceReuse( +autoReset = autoReset, referenceTracking = referenceTracking, usePool = usePool) --- End diff -- Likewise i see this is just how the code was written before but the `foo = foo` style isn't adding anything IMHO. Feel free to not name args --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r229738553 --- Diff: core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala --- @@ -456,9 +458,63 @@ class KryoSerializerSuite extends SparkFunSuite with SharedSparkContext { // Regression test for SPARK-7766, an issue where disabling auto-reset and enabling // reference-tracking would lead to corrupted output when serializer instances are re-used - for (referenceTracking <- Set(true, false); autoReset <- Set(true, false)) { -test(s"instance reuse with autoReset = $autoReset, referenceTracking = $referenceTracking") { - testSerializerInstanceReuse(autoReset = autoReset, referenceTracking = referenceTracking) + for { +referenceTracking <- Set(true, false) --- End diff -- Not that it matters, but I think this should have originally been Seq not Set --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r229737692 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -92,6 +94,16 @@ class KryoSerializer(conf: SparkConf) new KryoOutput(bufferSize, math.max(bufferSize, maxBufferSize)) } + @transient + private lazy val factory: KryoFactory = new KryoFactory() { --- End diff -- I think you're welcome to just write ... ``` private lazy val factory = new KryoFactory() { override def create: Kryo = newKryo() } ``` but it doesn't matter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r229738794 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -30,6 +30,7 @@ import scala.util.control.NonFatal import com.esotericsoftware.kryo.{Kryo, KryoException, Serializer => KryoClassSerializer} import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => KryoOutput} import com.esotericsoftware.kryo.io.{UnsafeInput => KryoUnsafeInput, UnsafeOutput => KryoUnsafeOutput} +import com.esotericsoftware.kryo.pool._ --- End diff -- I'd spell out the imports for clarity unless it's going to run more than 2 lines or something --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22855#discussion_r229737307 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -92,6 +94,16 @@ class KryoSerializer(conf: SparkConf) new KryoOutput(bufferSize, math.max(bufferSize, maxBufferSize)) } + @transient + private lazy val factory: KryoFactory = new KryoFactory() { +override def create: Kryo = { + newKryo() +} + } + + @transient + lazy val pool = new KryoPool.Builder(factory).softReferences.build --- End diff -- `private`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22723#discussion_r229720799 --- Diff: core/src/main/scala/org/apache/spark/rdd/WholeTextFileRDD.scala --- @@ -51,7 +51,7 @@ private[spark] class WholeTextFileRDD( case _ => } val jobContext = new JobContextImpl(conf, jobId) -inputFormat.setMinPartitions(jobContext, minPartitions) +inputFormat.setMinPartitions(sc, jobContext, minPartitions) --- End diff -- You don't need to pass the context. You just need to pass the one value you're going to use from it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22723#discussion_r229721018 --- Diff: core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala --- @@ -48,11 +50,11 @@ private[spark] class WholeTextFileInputFormat * Allow minPartitions set by end-user in order to keep compatibility with old Hadoop API, * which is set through setMaxSplitSize */ - def setMinPartitions(context: JobContext, minPartitions: Int) { + def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { val files = listStatus(context).asScala val totalLen = files.map(file => if (file.isDirectory) 0L else file.getLen).sum -val maxSplitSize = Math.ceil(totalLen * 1.0 / - (if (minPartitions == 0) 1 else minPartitions)).toLong +val minPartNum = Math.max(sc.defaultParallelism, minPartitions) +val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartNum).toLong --- End diff -- Yes, this no longer matches the title or JIRA. I am also not clear on the argument why this is better? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22893 I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM). Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note. And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work. (If you did remove this check you could remove some other values that get computed to check this bound, like precision1) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22901: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22901 That sounds like a good idea. I wonder if it's safe to back-port to 2.4/2.3? should be if it's just a maintenance release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22906: [SPARK-25895][Core]Adding testcase to compare Lz4 and Zs...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22906 I don't think this tests something about the correctness of Spark though. I am not sure this is worth it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22894 I also would not expect updating an immutable data structure to be faster. Building a map once from tuples at the end seems better than rebuilding a map each time. Under the hood the immutable map is going to be a HashTrieMap (a map of smaller optimized immutable maps) and its updated0 method does some clever stuff to avoid recreating the whole map. But, yeah, why immutable here to begin with? it ought to be better still to update a mutable Map. And then I am still not sure why it would be faster to keep the map invariants over this loop rather than build the map with its size known ahead of time at the end. Benchmarks are good evidence but we just need to make sure that the difference is material as used in Spark. It may well be. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22852 I think these are good changes. In a separate PR for the versions-specific docs, we could add a similar note to https://spark.apache.org/docs/latest/spark-standalone.html as much of the security concern is around the standalone master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22784 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22849: [SPARK-25852][Core] we should filter the workOffers with...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22849 Yes, that is the comment I have been referring to. So it seems you can't filter, right? it's not scheduling work here. @jiangxb1987 do you know this part of the code? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22885 Looks fine. Maybe some Maven cache somewhere has to be deleted as it's corrupt? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22856: [SPARK-25856][SQL][MINOR] Remove AverageLike and CountLi...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22856 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22813 I don't know what else goes in the work dir. It isn't valid to reuse it for anything else. Can you simply avoid using a work dir that is or has been used by something else? The argument for making this change anyway is just that the code should delete just what it writes. But I am just not sure it can be something you rely on for correct behavior --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22852 A quick pointer to security issues in other key places sounds good. As long as it is increasing the chance users understand the specific issue and isn't more general text to skip past, it is helping --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22849: [SPARK-25852][Core] we should filter the workOffe...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22849#discussion_r228936286 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -240,7 +240,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized { // Filter out executors under killing val activeExecutors = executorDataMap.filterKeys(executorIsAlive) -val workOffers = activeExecutors.map { +val workOffers = activeExecutors.filter(_._2.freeCores > 0).map { --- End diff -- I'm saying there is a comment a few lines above that describes this as a fake offer that explicitly intends to contact all executors. I think we need to figure out if that's still relevant. I don't have git in front of me but check git blame or GitHub to see when this was written? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228769714 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala --- @@ -49,7 +50,16 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) { "Try reducing the parameter k for PCA, or reduce the input feature " + "vector dimension to make this tractable.") -val mat = new RowMatrix(sources) +val mat = if (numFeatures > 65535) { + val meanVector = Statistics.colStats(sources).mean --- End diff -- Ah, another small nit: what about computing `meanVector.asBreeze` only once? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22802: [SPARK-25806][SQL]The instance of FileSplit is redundant
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22802 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228724594 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala --- @@ -54,4 +55,21 @@ class PCASuite extends SparkFunSuite with MLlibTestSparkContext { // check overflowing assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue) } + + test("number of features more than 65535") { +val rows = 10 +val columns = 10 +val k = 5 +val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0) +val pca = new PCA(k).fit(randomRDD) +assert(pca.explainedVariance.size === 5) +assert(pca.pc.numRows === 10 && pca.pc.numCols === 5) +// Eigen values should not be negative +assert(!pca.explainedVariance.values.exists(_ < 0)) + +// Norm of the principle component should be 1.0 --- End diff -- Nit: principle -> principal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228724541 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -384,18 +384,28 @@ class RowMatrix @Since("1.0.0") ( val n = numCols().toInt require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]") -val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] +if (n > 65535) { + val svd = computeSVD(k) + val s = svd.s.toArray.map(eigValue => eigValue * eigValue / (n - 1)) --- End diff -- Right, make sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228724515 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala --- @@ -49,7 +50,16 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) { "Try reducing the parameter k for PCA, or reduce the input feature " + "vector dimension to make this tractable.") -val mat = new RowMatrix(sources) +val mat = if (numFeatures > 65535) { + val meanVector = Statistics.colStats(sources).mean --- End diff -- Rather than call `.toArray` and `.zipped` below, can this not be written as Vector - Vector in the loop below? might be more efficient. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228724667 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala --- @@ -54,4 +55,14 @@ class PCASuite extends SparkFunSuite with MLlibTestSparkContext { // check overflowing assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue) } + + test("number of features more than 65500") { +val rows = 10 +val columns = 10 +val k = 5 +val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0) +val pca = new PCA(k).fit(randomRDD) +assert(pca.explainedVariance.size === 5) +assert(pca.pc.numRows === 10 && pca.pc.numCols === 5) --- End diff -- Is there an easy dummy test case we can write where we know what the first PC should be? like if you generate a bunch of vectors like (a +/- epsilon, a +/- epsilon, ...) for many a, the principal component should be (1,1,1...) nearly right? is that easy enough to add as a trivial test of the actual analysis? I think that would really prove it, though you manual test suggests it's working. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228724555 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala --- @@ -54,4 +55,21 @@ class PCASuite extends SparkFunSuite with MLlibTestSparkContext { // check overflowing assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue) } + + test("number of features more than 65535") { +val rows = 10 +val columns = 10 +val k = 5 +val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0) +val pca = new PCA(k).fit(randomRDD) +assert(pca.explainedVariance.size === 5) +assert(pca.pc.numRows === 10 && pca.pc.numCols === 5) +// Eigen values should not be negative +assert(!pca.explainedVariance.values.exists(_ < 0)) --- End diff -- You can write `.forAll(_ >= 0)` too, but doesn't matter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228714200 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -384,18 +384,28 @@ class RowMatrix @Since("1.0.0") ( val n = numCols().toInt require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]") -val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] +if (n > 65535) { + val svd = computeSVD(k) + val s = svd.s.toArray.map(eigValue => eigValue * eigValue / (n - 1)) --- End diff -- My linear algebra is probably rusty, so check me here, but we need the eigendecomposition of the covariance matrix 1/(n-1) * mat' * mat. I get how the singular values here give these eigenvalues. Don't the singular vectors V need to be divided by n-1 too? I'm probably wrong about it, just checking, esp. as we don't actually have tests for any of the output here! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228713925 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala --- @@ -54,4 +55,14 @@ class PCASuite extends SparkFunSuite with MLlibTestSparkContext { // check overflowing assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue) } + + test("number of features more than 65500") { +val rows = 10 +val columns = 10 +val k = 5 +val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0) +val pca = new PCA(k).fit(randomRDD) +assert(pca.explainedVariance.size === 5) +assert(pca.pc.numRows === 10 && pca.pc.numCols === 5) --- End diff -- I wonder if there's any reasonable way to check the answer here, like some bounds on what the eigenvalues/eigenvectors should be? Like the eigenvalues would at least be positive? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228713878 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala --- @@ -54,4 +55,14 @@ class PCASuite extends SparkFunSuite with MLlibTestSparkContext { // check overflowing assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue) } + + test("number of features more than 65500") { --- End diff -- 65535? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22849: [SPARK-25852][Core] we should filter the workOffe...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22849#discussion_r228713414 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -240,7 +240,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized { // Filter out executors under killing val activeExecutors = executorDataMap.filterKeys(executorIsAlive) -val workOffers = activeExecutors.map { +val workOffers = activeExecutors.filter(_._2.freeCores > 0).map { --- End diff -- I don't know this code well but the comment above implies that it means to make an offer on all executors? What is the performance impact anyway? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22813 I don't think that's a reasonable usage scenario. That said is there any harm to this change? would it ever miss cleaning something up that it should? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22860: Branch 2.4
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22860 @sarojchand close this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22859: Branch 2.2
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22859 @sarojchand close this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22425: [SPARK-23367][Build] Include python document style check...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22425 Merged to master. Anyone who feels like then addressing the style warnings, some that may be easy to address, go ahead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22425: [SPARK-23367][Build] Include python document styl...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22425#discussion_r228712794 --- Diff: dev/lint-python --- @@ -99,6 +104,29 @@ else echo "flake8 checks passed." fi +# Check python document style, skip check if pydocstyle is not installed. --- End diff -- Actually, seems like indentation is inconsistent in this file, so let's leave it for now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22850: [MINOR][DOC] Fix comment error of HiveUtils
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22850 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22815: [SPARK-25821][SQL] Remove SQLContext methods deprecated ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22815 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22848: [SPARK-25851][SQL][MINOR] Fix deprecated API warning in ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22848 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22852 I don't feel strongly about it; go ahead. If someone lands on this page, do they pretty easily come away with the impression they need to set spark.authenticate and network security if they care about security? if so, great. If the text is just adding to the text they might skip over, maybe revise it. That's how I think about it. I think you can make edits for Mesos and K8S here too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22854: [SPARK-25854] fix mvn to not always exit 1
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22854#discussion_r228583684 --- Diff: build/mvn --- @@ -163,8 +163,19 @@ export MAVEN_OPTS=${MAVEN_OPTS:-"$_COMPILE_JVM_OPTS"} echo "Using \`mvn\` from path: $MVN_BIN" 1>&2 -# Last, call the `mvn` command as usual +# call the `mvn` command as usual "${MVN_BIN}" -DzincPort=${ZINC_PORT} "$@" -# Try to shut down zinc explicitly -"${ZINC_BIN}" -shutdown -port ${ZINC_PORT} +# check to see if zinc server is still running post-build +"${ZINC_BIN}" -status -port ${ZINC_PORT} &> /dev/null +ZINC_STATUS=$? + +# Try to shut down zinc explicitly if the server is still running +if [ $ZINC_STATUS -eq 0 ]; then + # zinc is still running! + "${ZINC_BIN}" -shutdown -port ${ZINC_PORT} + exit 0 --- End diff -- I suppose you just want to exit 0 outside the if-else, for clarity, but whatever. This is all fine and can be back-ported back to 2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22802: [SPARK-25806][SQL]The instance of FileSplit is redundant...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22802 OK, that's pretty trivial, but I agree. Any other instances of this type of thing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22843: [SPARK-16693][SPARKR] Remove methods deprecated
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22843 I personally am OK with removing anything deprecated before 2.4. Even things deprecated in 2.4 are technically fair game. If any of these were deprecated pretty recently, might be debatable, but I am in favor, myself. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22826: [SPARK-25760][DOCS][FOLLOWUP] Add note about AddJar retu...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22826 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22815: [SPARK-25821][SQL] Remove SQLContext methods depr...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22815#discussion_r228554831 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -54,6 +54,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager * @groupname Ungrouped Support functions for language integrated queries * @since 1.0.0 */ +@deprecated("Use SparkSession instead", "3.0.0") --- End diff -- One more very late question for 2.4 -- @rxin is it even better to deprecate this right now in 2.4.0? and in Python, R? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22848: [SPARK-25851][SQL][MINOR] Fix deprecated API warn...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22848#discussion_r228553964 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala --- @@ -89,12 +89,12 @@ private class LongLongTupleConverter extends Converter[(Object, Object), (Long, } override def getInputType(typeFactory: TypeFactory): JavaType = { -val objectType = typeFactory.uncheckedSimpleType(classOf[Object]) -typeFactory.constructSimpleType(classOf[(_, _)], classOf[(_, _)], Array(objectType, objectType)) +val objectType = typeFactory.constructType(classOf[Object]) +typeFactory.constructSimpleType(classOf[(_, _)], Array(objectType, objectType)) } override def getOutputType(typeFactory: TypeFactory): JavaType = { -val longType = typeFactory.uncheckedSimpleType(classOf[Long]) -typeFactory.constructSimpleType(classOf[(_, _)], classOf[(_, _)], Array(longType, longType)) +val longType = typeFactory.constructType(classOf[(Long)]) --- End diff -- Are the parentheses around `Long` needed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22852 I get it, the "it's your responsibility" stance, and it is. For any risk there's a sentence in this doc we could point to and say, "see, told you". If we're going to make a change here, adding another para saying "below, we told you so" isn't additive. Especially if we're trying to use this change to actively mitigate security issues. More useful is a cheat-sheet, TL;DR, simply enumerating the top things you don't want to miss. I think it's more useful than redundant. I can take a crack at that too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22723 Kind of. As you say it lets a particular `InputFormat` decide this. But it still uses the minPartitions as input. See again https://issues.apache.org/jira/browse/SPARK-22357 for some subtlety, and that much I agree with. I think that's the substance of your change too. But then, if that's true, isn't this a no-op for the case you're interested in? it does affect, then, other `InputFormat`s, but I'm not clear that's valid. Or then: why just make this change in one place? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org