[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

2018-11-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22894#discussion_r231122195
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-private var hugeBlockSizes: Map[Int, Byte])
+private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
--- End diff --

It's mutated though, and needs to be mutable. If it were exposed outside 
the class, or there was significant danger of accidentally mutating it 
elsewhere, I think it might be necessary to wrap the result in an immutable 
wrapper, but here this seems OK to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

2018-11-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22894#discussion_r231121793
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -189,13 +188,12 @@ private[spark] class HighlyCompressedMapStatus 
private (
 emptyBlocks.readExternal(in)
 avgSize = in.readLong()
 val count = in.readInt()
-val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
+hugeBlockSizes = mutable.Map.empty[Int, Byte]
 (0 until count).foreach { _ =>
   val block = in.readInt()
   val size = in.readByte()
-  hugeBlockSizesArray += Tuple2(block, size)
+  hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, 
size)
--- End diff --

Yes, it is a mutable map and used as a mutable map. Its type must reflect 
that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22950: [MINOR] Fix typos and misspellings

2018-11-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22950
  
Merged to master/2.4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...

2018-11-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r230949468
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala
 ---
@@ -67,6 +68,10 @@ class MulticlassClassificationEvaluator @Since("1.5.0") 
(@Since("1.5.0") overrid
   @Since("1.5.0")
   def setLabelCol(value: String): this.type = set(labelCol, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
--- End diff --

Will need to be 3.0.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...

2018-11-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r230949719
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala 
---
@@ -27,10 +27,17 @@ import org.apache.spark.sql.DataFrame
 /**
  * Evaluator for multiclass classification.
  *
- * @param predictionAndLabels an RDD of (prediction, label) pairs.
+ * @param predAndLabelsWithOptWeight an RDD of (prediction, label, weight) 
or
+ * (prediction, label) pairs.
  */
 @Since("1.1.0")
-class MulticlassMetrics @Since("1.1.0") (predictionAndLabels: RDD[(Double, 
Double)]) {
+class MulticlassMetrics @Since("2.4.0") (predAndLabelsWithOptWeight: 
RDD[_]) {
--- End diff --

Although I think we're not updating .mllib much at all now, I think this is 
a simple and backwards-compatible change so think it's OK. It is the 
implementation behind the .ml version anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...

2018-11-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r230949430
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -18,10 +18,14 @@
 package org.apache.spark.mllib.evaluation
 
 import org.apache.spark.SparkFunSuite
-import org.apache.spark.mllib.linalg.Matrices
+import org.apache.spark.ml.linalg.Matrices
+import org.apache.spark.ml.util.TestingUtils._
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 
 class MulticlassMetricsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  val delta = 1e-7
--- End diff --

`private`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22087: [SPARK-25097][ML] Support prediction on single in...

2018-11-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22087#discussion_r230948830
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala ---
@@ -117,7 +117,8 @@ class BisectingKMeansModel private[ml] (
 validateAndTransformSchema(schema)
   }
 
-  private[clustering] def predict(features: Vector): Int = 
parentModel.predict(features)
+  @Since("2.4.0")
--- End diff --

This would have to be 3.0.0.

I don't see a good reason not to expose this. CCing also maybe @mgaido91 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22914
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R

2018-11-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22940
  
I'm neutral on it... seems like a logical change but is there any issue 
(like ambiguous imports that are annoying) other than not matching the pattern? 
I am not super concerned about back-porting conflicts, but still it's a 
non-trivial concern.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230784165
  
--- Diff: R/pkg/R/generics.R ---
@@ -748,7 +748,7 @@ setGeneric("add_months", function(y, x) { 
standardGeneric("add_months") })
 
 #' @rdname column_aggregate_functions
 #' @name NULL
-setGeneric("approxCountDistinct", function(x, ...) { 
standardGeneric("approxCountDistinct") })
+setGeneric("approx_count_distinct", function(x, ...) { 
standardGeneric("approx_count_distinct") })
--- End diff --

I think my comment didn't get connected to this one -- @felixcheung what do 
you think about the argument that this almost surely was meant to be deprecated 
along with counterparts in Scala/Python? leaving them in would make this 
inconsistent. As the degrees, radians, and approxCountDistinct are reasonably 
niche and have a direct replacement that's compatible with older versions, I 
feel like this is OK for 3.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22950: [MINOR] Fix typos and misspellings

2018-11-05 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/22950

[MINOR] Fix typos and misspellings

## What changes were proposed in this pull request?

Fix typos and misspellings, per 
https://github.com/apache/spark-website/pull/158#issuecomment-435790366

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark Typos

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22950.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22950


commit 8ebc3c038b72db726868b005ebb7ee5cca78cdbd
Author: Sean Owen 
Date:   2018-11-05T15:02:34Z

Fix typos and misspellings




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22931: [SPARK-25930][K8s] Fix scala string detection in k8s tes...

2018-11-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22931
  
Merged to master/2.4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22921
  
Yeah it's a good point that these weren't deprecated, but I assume they 
should have been. Same change, same time, same logic. given that it's a 
reasonably niche method, I thought it would be best to go ahead and be 
consistent here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-03 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230568378
  
--- Diff: R/pkg/R/functions.R ---
@@ -319,23 +319,23 @@ setMethod("acos",
   })
 
 #' @details
-#' \code{approxCountDistinct}: Returns the approximate number of distinct 
items in a group.
+#' \code{approx_count_distinct}: Returns the approximate number of 
distinct items in a group.
 #'
 #' @rdname column_aggregate_functions
-#' @aliases approxCountDistinct approxCountDistinct,Column-method
+#' @aliases approx_count_distinct approx_count_distinct,Column-method
 #' @examples
 #'
 #' \dontrun{
-#' head(select(df, approxCountDistinct(df$gear)))
-#' head(select(df, approxCountDistinct(df$gear, 0.02)))
+#' head(select(df, approx_count_distinct(df$gear)))
+#' head(select(df, approx_count_distinct(df$gear, 0.02)))
 #' head(select(df, countDistinct(df$gear, df$cyl)))
 #' head(select(df, n_distinct(df$gear)))
 #' head(distinct(select(df, "gear")))}
-#' @note approxCountDistinct(Column) since 1.4.0
-setMethod("approxCountDistinct",
+#' @note approx_count_distinct(Column) since 2.0.0
--- End diff --

Right, will fix that one too if I missed it, per 
https://github.com/apache/spark/pull/22921#discussion_r230449173


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22933: [SPARK-25933][DOCUMENTATION] Fix pstats.Stats() referenc...

2018-11-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22933
  
Merged to master/2.4/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22893
  
OK, the Spark part doesn't seem relevant. The input might be more realistic 
here, yes. I was commenting that your test code doesn't show what you're 
testing, though I understand you manually modified it. Because the test is so 
central here I think it's important to understand exactly what you're measuring 
and exactly what you're running.

This doesn't show an improvement, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

2018-11-03 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22894#discussion_r230556818
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -189,13 +188,12 @@ private[spark] class HighlyCompressedMapStatus 
private (
 emptyBlocks.readExternal(in)
 avgSize = in.readLong()
 val count = in.readInt()
-val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
+hugeBlockSizes = mutable.Map.empty[Int, Byte]
 (0 until count).foreach { _ =>
   val block = in.readInt()
   val size = in.readByte()
-  hugeBlockSizesArray += Tuple2(block, size)
+  hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, 
size)
--- End diff --

Why cast it? it is used as a mutable map and its type is a mutable map, so 
the type on line 151 is wrong.  Also, just `hugeBlockSizes(block) = size`, no?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22934: [BUILD] Close stale PRs

2018-11-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22934
  
Maybe add https://github.com/apache/spark/pull/22849



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22855: [SPARK-25839] [Core] Implement use of KryoPool in KryoSe...

2018-11-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22855
  
Oops, something to do with the benchmark class:

```
[error] 
/home/jenkins/workspace/NewSparkPullRequestBuilder/core/src/test/scala/org/apache/spark/serializer/KryoSerializerBenchmark.scala:41:
 object creation impossible, since method runBenchmarkSuite in class 
BenchmarkBase of type (mainArgs: Array[String])Unit is not defined
[error] object KryoSerializerBenchmark extends BenchmarkBase {
[error]^
[error] 
/home/jenkins/workspace/NewSparkPullRequestBuilder/core/src/test/scala/org/apache/spark/serializer/KryoSerializerBenchmark.scala:45:
 method runBenchmarkSuite overrides nothing.
[error] Note: the super classes of object KryoSerializerBenchmark contain 
the following, non final members named runBenchmarkSuite:
[error] def runBenchmarkSuite(mainArgs: Array[String]): Unit
[error]   override def runBenchmarkSuite(): Unit = {
[error]^
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and us...

2018-11-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22892
  
@cloud-fan @ueshin I'm not sure, but I'm seeing this failure regularly in 
master builds and I wonder if this could be the cause? in two builds it started 
failing with this (and another) change:


https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5561/testReport/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/574/testReport/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...

2018-11-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22864
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22931: [SPARK-25930][K8s] Fix scala string detection in k8s tes...

2018-11-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22931
  
OK. Does this need to go in branch 2.4 too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230449544
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -275,15 +273,6 @@ def _():
 del _name, _doc
 
 
-@since(1.3)
-def approxCountDistinct(col, rsd=None):
--- End diff --

Yeah, in some cases the deprecated user-facing method was named the same 
way as some internal method and I changed the wrong one. I'll investigate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230449173
  
--- Diff: R/pkg/R/functions.R ---
@@ -1641,30 +1641,30 @@ setMethod("tanh",
   })
 
 #' @details
-#' \code{toDegrees}: Converts an angle measured in radians to an 
approximately equivalent angle
+#' \code{degrees}: Converts an angle measured in radians to an 
approximately equivalent angle
 #' measured in degrees.
 #'
 #' @rdname column_math_functions
-#' @aliases toDegrees toDegrees,Column-method
-#' @note toDegrees since 1.4.0
-setMethod("toDegrees",
+#' @aliases degrees degrees,Column-method
+#' @note degrees since 2.1.0
--- End diff --

degrees was added in Scala/Python in 2.1.0, which is what I was thinking 
of, but yeah really this must be since 3.0.0 right? I'll fix it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230436963
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -279,27 +254,6 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   structDf.select(hash($"a", $"record.*")))
   }
 
-  test("Star Expansion - explode should fail with a meaningful message if 
it takes a star") {
--- End diff --

Yeah I'll try to bring back to the test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22893
  
So the pull request right now doesn't reflect what you tested, but you 
tested the version pasted above. You're saying that the optimization just never 
helps the dense-dense case, and sqdist is faster than a dot product. This 
doesn't make sense mathematically as it should be more math, but stranger 
things have happened.

Still, I don't follow your test code here. You parallelize one vector, map 
it, collect it: why use Spark? and it's the same vector over and over, and it's 
not a big vector. Your sparse vectors aren't very sparse.

How about more representative input -- larger vectors (100s of elements, 
probably), more sparse sparse vectors, and a large set of different inputs. I 
also don't see where the precision bound is changed here?

This may be a good change but I'm just not yet convinced by the test 
methodology, and the result still doesn't make much intuitive sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22922: [SPARK-25909] fix documentation on cluster managers

2018-11-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22922
  
Merged to master/2.4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-11-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r230421073
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -84,6 +85,7 @@ class KryoSerializer(conf: SparkConf)
   private val avroSchemas = conf.getAvroSchema
   // whether to use unsafe based IO for serialization
   private val useUnsafe = conf.getBoolean("spark.kryo.unsafe", false)
+  private val usePool = conf.getBoolean("spark.kryo.pool", false)
--- End diff --

Yeah I think that's a fine position to take, if we can't think of a reason 
to disable it other than the theoretical unknown unknown bug out there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22906: [SPARK-25895][Core]Adding testcase to compare Lz4 and Zs...

2018-11-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22906
  
Benchmarks aren't run with tests right? That's fine. I am still not sure 
Spark is the best place for this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22919
  
Tough call. At least it's worth documenting that this behavior changed 
because so did the Scala shell's behavior, and I'd support that. I'd also 
support supporting both, if there's no real downside to that option?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22924: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22924
  
Yeah I get that. I haven't heard the bugs that are fixed here impact users, 
but a few sound kind of bad.  Maybe this should depend more on: how much do 
people need Python 3.7 in 2.4.x? that seems like a valid basis to back-port, if 
it were in demand. A 'bug fix' of a different sort, fixing incompatibilities 
with 3.7


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22864
  
Jenkins add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22864
  
That's a spurious failure, something's wrong with the build machine


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22922: [SPARK-25909] fix documentation on cluster manage...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22922#discussion_r230159527
  
--- Diff: docs/cluster-overview.md ---
@@ -45,7 +45,7 @@ There are several useful things to note about this 
architecture:
 
 # Cluster Manager Types
 
-The system currently supports three cluster managers:
+The system currently supports four cluster managers:
--- End diff --

Heh OK. Let's just say 'several' to reduce future maintenance. Anything 
else that can be fixed in this doc?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22758: [SPARK-25332][SQL] Instead of broadcast hash join ,Sort ...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22758
  
I don't know this code well enough to review. I think there is skepticism 
from people who know this code whether this is change is correct and 
beneficial. If there's doubt, I think it should be closed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230156120
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -62,17 +62,6 @@ class SQLContext private[sql](val sparkSession: 
SparkSession)
 
   sparkSession.sparkContext.assertNotStopped()
 
-  // Note: Since Spark 2.0 this class has become a wrapper of 
SparkSession, where the
--- End diff --

Makes sense, will add it back. Yeah will leave this open a short while to 
make sure there is time to comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230155788
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -639,20 +639,6 @@ private[spark] object SparkConf extends Logging {
*/
   private val deprecatedConfigs: Map[String, DeprecatedConfig] = {
 val configs = Seq(
-  DeprecatedConfig("spark.cache.class", "0.8",
--- End diff --

Yeah I can add them back. Wasn't sure whether they are still valuable or 
just old.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22924: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22924
  
I think it's OK to backport at 3.7 support is actually fairly important, 
and this also fixes bugs: https://www.py4j.org/changelog.html

It drops 2.6 support but we did so as well a long time ago in 2.2.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22909: [SPARK-25897][k8s] Hook up k8s integration tests to sbt ...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22909
  
I don't know SBT much myself. I would believe this works.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22914
  
Yes if the behavior is consistent across pages that sounds like fine 
behavior. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22908: [MONOR][SQL] Replace all TreeNode's node name in ...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22908#discussion_r230060094
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -56,7 +56,7 @@ case class DataSourceV2Relation(
 
   override def pushedFilters: Seq[Expression] = Seq.empty
 
-  override def simpleString: String = "RelationV2 " + metadataString
+  override def simpleString: String = s"$nodeName " + metadataString
--- End diff --

This uses interpolation and concatenation - just interpolate


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230057306
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
 ---
@@ -246,7 +246,7 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("toDegrees") {
 testUnary(ToDegrees, math.toDegrees)
-checkConsistencyBetweenInterpretedAndCodegen(Acos, DoubleType)
+checkConsistencyBetweenInterpretedAndCodegen(ToDegrees, DoubleType)
--- End diff --

This was just a bug fix I spotted


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230057481
  
--- Diff: R/pkg/R/generics.R ---
@@ -748,7 +748,7 @@ setGeneric("add_months", function(y, x) { 
standardGeneric("add_months") })
 
 #' @rdname column_aggregate_functions
 #' @name NULL
-setGeneric("approxCountDistinct", function(x, ...) { 
standardGeneric("approxCountDistinct") })
+setGeneric("approx_count_distinct", function(x, ...) { 
standardGeneric("approx_count_distinct") })
--- End diff --

@felixcheung might want to check if I'm handling these R changes correctly


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/22921

[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3

## What changes were proposed in this pull request?

- Remove some AccumulableInfo .apply() methods
- Remove non-label-specific multiclass precision/recall/fScore in favor of 
accuracy
- Remove toDegrees/toRadians in favor of degrees/radians
- Remove approxCountDistinct in favor of approx_count_distinct
- Remove unused Python StorageLevel constants
- Remove Dataset unionAll in favor of union
- Remove unused multiclass option in libsvm parsing
- Remove references to deprecated spark configs like spark.yarn.am.port
- Remove TaskContext.isRunningLocally
- Remove ShuffleMetrics.shuffle* methods
- Remove BaseReadWrite.context in favor of session
- Remove Column.!== in favor of =!=
- Remove Dataset.explode
- Remove Dataset.registerTempTable
- Remove SQLContext.getOrCreate, setActive, clearActive, constructors

Not touched yet

- everything else in MLLib
- HiveContext
- Anything deprecated more recently than 2.0.0, generally

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-25908

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22921.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22921


commit 259e7d15f302a02fbf09dbc63f46cd1d690b1c2a
Author: Sean Owen 
Date:   2018-11-01T14:18:29Z

Remove many older deprecated items in Spark 3




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22813
  
I just think that if you have engineers randomly writing and reading stuff 
in this dir, a bunch of other stuff goes wrong. This is not a problem that 
Spark can reasonably solve. Certainly, you have much bigger production problems 
if this level of discipline can't be enforced.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22893
  
Hm, actually that's the best case. You're exercising the case where the 
code path you prefer is fast. And the case where the precision bound applies is 
exactly the case where the branch you deleted helps. As I say, you'd have to 
show this is not impacting other cases significantly, and I think it should. 
Consider the sparse-sparse case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22876: [SPARK-25869] [YARN] the original diagnostics is ...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22876#discussion_r230053125
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 ---
@@ -293,6 +293,9 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments) extends
 }
 
 if (!unregistered) {
+  logInfo("Waiting for " + 
sparkConf.get("spark.yarn.report.interval", "1000").toInt +"ms to unregister 
am," +
--- End diff --

Use interpolation. No need to call `.toInt`. am -> AM or Application 
master. msg -> message; these should be complete sentences. You get the config 
twice here. Is this the default used elsewhere?

I don't know this code well and don't think a Thread.sleep is a great way 
to coordinate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-11-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r230052150
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -84,6 +85,7 @@ class KryoSerializer(conf: SparkConf)
   private val avroSchemas = conf.getAvroSchema
   // whether to use unsafe based IO for serialization
   private val useUnsafe = conf.getBoolean("spark.kryo.unsafe", false)
+  private val usePool = conf.getBoolean("spark.kryo.pool", false)
--- End diff --

I would not document it. This is just a safety valve. In theory, there's no 
reason to disable this nor would a caller know why to disable it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22914
  
Returning to the current or last page seems fine too. No error message 
really needed IMHO.
However it sounds like other pages already go to the 1st page on invalid 
input? I'd also like to stay consistent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22894#discussion_r229812240
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus 
private (
 emptyBlocks.readExternal(in)
 avgSize = in.readLong()
 val count = in.readInt()
-val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
+hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala
--- End diff --

Does calling `HashMap.sizeHint(...)` here actually help?
I'd stick to the scala collection for now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22894#discussion_r229776086
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus 
private (
 emptyBlocks.readExternal(in)
 avgSize = in.readLong()
 val count = in.readInt()
-val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
+hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala
--- End diff --

How about just scala's mutable Map? I'd expect it's no slower than Java's, 
given it might specialize for primitives (not sure about this) and sometimes 
has smarter implementations internally.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r229738657
  
--- Diff: 
core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala ---
@@ -456,9 +458,63 @@ class KryoSerializerSuite extends SparkFunSuite with 
SharedSparkContext {
 
   // Regression test for SPARK-7766, an issue where disabling auto-reset 
and enabling
   // reference-tracking would lead to corrupted output when serializer 
instances are re-used
-  for (referenceTracking <- Set(true, false); autoReset <- Set(true, 
false)) {
-test(s"instance reuse with autoReset = $autoReset, referenceTracking = 
$referenceTracking") {
-  testSerializerInstanceReuse(autoReset = autoReset, referenceTracking 
= referenceTracking)
+  for {
+referenceTracking <- Set(true, false)
+autoReset <- Set(true, false)
+usePool <- Set(true, false)
+  } {
+test(s"instance reuse with autoReset = $autoReset, referenceTracking = 
$referenceTracking" +
+  s", usePool = $usePool") {
+  testSerializerInstanceReuse(
+autoReset = autoReset, referenceTracking = referenceTracking, 
usePool = usePool)
+}
+  }
+
+  test("SPARK-25839 KryoPool implementation works correctly in 
multi-threaded environment") {
+import java.util.concurrent.Executors
--- End diff --

I'd import at the top of the file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r229740163
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -84,6 +85,7 @@ class KryoSerializer(conf: SparkConf)
   private val avroSchemas = conf.getAvroSchema
   // whether to use unsafe based IO for serialization
   private val useUnsafe = conf.getBoolean("spark.kryo.unsafe", false)
+  private val usePool = conf.getBoolean("spark.kryo.pool", false)
--- End diff --

Yep, I'd leave this undocumented


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r229739685
  
--- Diff: 
core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala ---
@@ -456,9 +458,63 @@ class KryoSerializerSuite extends SparkFunSuite with 
SharedSparkContext {
 
   // Regression test for SPARK-7766, an issue where disabling auto-reset 
and enabling
   // reference-tracking would lead to corrupted output when serializer 
instances are re-used
-  for (referenceTracking <- Set(true, false); autoReset <- Set(true, 
false)) {
-test(s"instance reuse with autoReset = $autoReset, referenceTracking = 
$referenceTracking") {
-  testSerializerInstanceReuse(autoReset = autoReset, referenceTracking 
= referenceTracking)
+  for {
+referenceTracking <- Set(true, false)
+autoReset <- Set(true, false)
+usePool <- Set(true, false)
+  } {
+test(s"instance reuse with autoReset = $autoReset, referenceTracking = 
$referenceTracking" +
+  s", usePool = $usePool") {
+  testSerializerInstanceReuse(
+autoReset = autoReset, referenceTracking = referenceTracking, 
usePool = usePool)
--- End diff --

Likewise i see this is just how the code was written before but the `foo = 
foo` style isn't adding anything IMHO. Feel free to not name args


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r229738553
  
--- Diff: 
core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala ---
@@ -456,9 +458,63 @@ class KryoSerializerSuite extends SparkFunSuite with 
SharedSparkContext {
 
   // Regression test for SPARK-7766, an issue where disabling auto-reset 
and enabling
   // reference-tracking would lead to corrupted output when serializer 
instances are re-used
-  for (referenceTracking <- Set(true, false); autoReset <- Set(true, 
false)) {
-test(s"instance reuse with autoReset = $autoReset, referenceTracking = 
$referenceTracking") {
-  testSerializerInstanceReuse(autoReset = autoReset, referenceTracking 
= referenceTracking)
+  for {
+referenceTracking <- Set(true, false)
--- End diff --

Not that it matters, but I think this should have originally been Seq not 
Set


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r229737692
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -92,6 +94,16 @@ class KryoSerializer(conf: SparkConf)
   new KryoOutput(bufferSize, math.max(bufferSize, maxBufferSize))
 }
 
+  @transient
+  private lazy val factory: KryoFactory = new KryoFactory() {
--- End diff --

I think you're welcome to just write ...
```
private lazy val factory = new KryoFactory() {
  override def create: Kryo = newKryo()
}
```
but it doesn't matter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r229738794
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -30,6 +30,7 @@ import scala.util.control.NonFatal
 import com.esotericsoftware.kryo.{Kryo, KryoException, Serializer => 
KryoClassSerializer}
 import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => 
KryoOutput}
 import com.esotericsoftware.kryo.io.{UnsafeInput => KryoUnsafeInput, 
UnsafeOutput => KryoUnsafeOutput}
+import com.esotericsoftware.kryo.pool._
--- End diff --

I'd spell out the imports for clarity unless it's going to run more than 2 
lines or something


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r229737307
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -92,6 +94,16 @@ class KryoSerializer(conf: SparkConf)
   new KryoOutput(bufferSize, math.max(bufferSize, maxBufferSize))
 }
 
+  @transient
+  private lazy val factory: KryoFactory = new KryoFactory() {
+override def create: Kryo = {
+  newKryo()
+}
+  }
+
+  @transient
+  lazy val pool = new KryoPool.Builder(factory).softReferences.build
--- End diff --

`private`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22723#discussion_r229720799
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/WholeTextFileRDD.scala 
---
@@ -51,7 +51,7 @@ private[spark] class WholeTextFileRDD(
   case _ =>
 }
 val jobContext = new JobContextImpl(conf, jobId)
-inputFormat.setMinPartitions(jobContext, minPartitions)
+inputFormat.setMinPartitions(sc, jobContext, minPartitions)
--- End diff --

You don't need to pass the context. You just need to pass the one value 
you're going to use from it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22723: [SPARK-25729][CORE]It is better to replace `minPa...

2018-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22723#discussion_r229721018
  
--- Diff: 
core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala ---
@@ -48,11 +50,11 @@ private[spark] class WholeTextFileInputFormat
* Allow minPartitions set by end-user in order to keep compatibility 
with old Hadoop API,
* which is set through setMaxSplitSize
*/
-  def setMinPartitions(context: JobContext, minPartitions: Int) {
+  def setMinPartitions(sc: SparkContext, context: JobContext, 
minPartitions: Int) {
 val files = listStatus(context).asScala
 val totalLen = files.map(file => if (file.isDirectory) 0L else 
file.getLen).sum
-val maxSplitSize = Math.ceil(totalLen * 1.0 /
-  (if (minPartitions == 0) 1 else minPartitions)).toLong
+val minPartNum = Math.max(sc.defaultParallelism, minPartitions)
+val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartNum).toLong
--- End diff --

Yes, this no longer matches the title or JIRA. I am also not clear on the 
argument why this is better?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-10-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22893
  
I don't think BLAS matters here as these are all vector-vector operations 
and f2jblas is used directly (i.e. stays in the JVM). 

Are all the vectors dense? I suppose I'm still surprised if sqdist is 
faster than dot here as it ought to be a little more math. The sparse-dense 
case might come out differently, note. 

And I suppose I have a hard time believing that the sparse-sparse case is 
faster after this change (when the precision bound is met) because now it's 
handled in the sparse-sparse if case in this code, which definitely does a dot 
plus more work.

(If you did remove this check you could remove some other values that get 
computed to check this bound, like precision1)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22901: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1

2018-10-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22901
  
That sounds like a good idea. I wonder if it's safe to back-port to 
2.4/2.3? should be if it's just a maintenance release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22906: [SPARK-25895][Core]Adding testcase to compare Lz4 and Zs...

2018-10-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22906
  
I don't think this tests something about the correctness of Spark though. I 
am not sure this is worth it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

2018-10-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22894
  
I also would not expect updating an immutable data structure to be faster. 
Building a map once from tuples at the end seems better than rebuilding a map 
each time. Under the hood the immutable map is going to be a HashTrieMap (a map 
of smaller optimized immutable maps) and its updated0 method does some clever 
stuff to avoid recreating the whole map.

But, yeah, why immutable here to begin with? it ought to be better still to 
update a mutable Map. And then I am still not sure why it would be faster to 
keep the map invariants over this loop rather than build the map with its size 
known ahead of time at the end.

Benchmarks are good evidence but we just need to make sure that the 
difference is material as used in Spark. It may well be.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation

2018-10-30 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22852
  
I think these are good changes. In a separate PR for the versions-specific 
docs, we could add a similar note to 
https://spark.apache.org/docs/latest/spark-standalone.html as much of the 
security concern is around the standalone master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-30 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22784
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22849: [SPARK-25852][Core] we should filter the workOffers with...

2018-10-30 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22849
  
Yes, that is the comment I have been referring to. So it seems you can't 
filter, right? it's not scheduling work here. @jiangxb1987 do you know this 
part of the code?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-30 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22885
  
Looks fine. Maybe some Maven cache somewhere has to be deleted as it's 
corrupt?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22856: [SPARK-25856][SQL][MINOR] Remove AverageLike and CountLi...

2018-10-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22856
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...

2018-10-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22813
  
I don't know what else goes in the work dir. It isn't valid to reuse it for 
anything else. Can you simply avoid using a work dir that is or has been used 
by something else?

The argument for making this change anyway is just that the code should 
delete just what it writes. But I am just not sure it can be something you rely 
on for correct behavior



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation

2018-10-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22852
  
A quick pointer to security issues in other key places sounds good. As long 
as it is increasing the chance users understand the specific issue and isn't 
more general text to skip past, it is helping


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22849: [SPARK-25852][Core] we should filter the workOffe...

2018-10-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22849#discussion_r228936286
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -240,7 +240,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
 // Filter out executors under killing
 val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
-val workOffers = activeExecutors.map {
+val workOffers = activeExecutors.filter(_._2.freeCores > 0).map {
--- End diff --

I'm saying there is a comment a few lines above that describes this as a 
fake offer that explicitly intends to contact all executors. I think we need to 
figure out if that's still relevant. I don't have git in front of me but check 
git blame or GitHub to see when this was written?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228769714
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -49,7 +50,16 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
   "Try reducing the parameter k for PCA, or reduce the input feature " 
+
   "vector dimension to make this tractable.")
 
-val mat = new RowMatrix(sources)
+val mat = if (numFeatures > 65535) {
+  val meanVector = Statistics.colStats(sources).mean
--- End diff --

Ah, another small nit: what about computing `meanVector.asBreeze` only once?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22802: [SPARK-25806][SQL]The instance of FileSplit is redundant

2018-10-28 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22802
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228724594
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala ---
@@ -54,4 +55,21 @@ class PCASuite extends SparkFunSuite with 
MLlibTestSparkContext {
 // check overflowing
 assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue)
   }
+
+  test("number of features more than 65535") {
+val rows = 10
+val columns = 10
+val k = 5
+val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0)
+val pca = new PCA(k).fit(randomRDD)
+assert(pca.explainedVariance.size === 5)
+assert(pca.pc.numRows === 10 && pca.pc.numCols === 5)
+// Eigen values should not be negative
+assert(!pca.explainedVariance.values.exists(_ < 0))
+
+// Norm of the principle component should be 1.0
--- End diff --

Nit: principle -> principal


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228724541
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -384,18 +384,28 @@ class RowMatrix @Since("1.0.0") (
 val n = numCols().toInt
 require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]")
 
-val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]
+if (n > 65535) {
+  val svd = computeSVD(k)
+  val s = svd.s.toArray.map(eigValue => eigValue * eigValue / (n - 1))
--- End diff --

Right, make sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228724515
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -49,7 +50,16 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
   "Try reducing the parameter k for PCA, or reduce the input feature " 
+
   "vector dimension to make this tractable.")
 
-val mat = new RowMatrix(sources)
+val mat = if (numFeatures > 65535) {
+  val meanVector = Statistics.colStats(sources).mean
--- End diff --

Rather than call `.toArray` and `.zipped` below, can this not be written as 
Vector - Vector in the loop below? might be more efficient.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228724667
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala ---
@@ -54,4 +55,14 @@ class PCASuite extends SparkFunSuite with 
MLlibTestSparkContext {
 // check overflowing
 assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue)
   }
+
+  test("number of features more than 65500") {
+val rows = 10
+val columns = 10
+val k = 5
+val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0)
+val pca = new PCA(k).fit(randomRDD)
+assert(pca.explainedVariance.size === 5)
+assert(pca.pc.numRows === 10 && pca.pc.numCols === 5)
--- End diff --

Is there an easy dummy test case we can write where we know what the first 
PC should be? like if you generate a bunch of vectors like (a +/- epsilon, a 
+/- epsilon, ...) for many a, the principal component should be (1,1,1...) 
nearly right? is that easy enough to add as a trivial test of the actual 
analysis? I think that would really prove it, though you manual test suggests 
it's working.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228724555
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala ---
@@ -54,4 +55,21 @@ class PCASuite extends SparkFunSuite with 
MLlibTestSparkContext {
 // check overflowing
 assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue)
   }
+
+  test("number of features more than 65535") {
+val rows = 10
+val columns = 10
+val k = 5
+val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0)
+val pca = new PCA(k).fit(randomRDD)
+assert(pca.explainedVariance.size === 5)
+assert(pca.pc.numRows === 10 && pca.pc.numCols === 5)
+// Eigen values should not be negative
+assert(!pca.explainedVariance.values.exists(_ < 0))
--- End diff --

You can write `.forAll(_ >= 0)` too, but doesn't matter


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228714200
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -384,18 +384,28 @@ class RowMatrix @Since("1.0.0") (
 val n = numCols().toInt
 require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]")
 
-val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]
+if (n > 65535) {
+  val svd = computeSVD(k)
+  val s = svd.s.toArray.map(eigValue => eigValue * eigValue / (n - 1))
--- End diff --

My linear algebra is probably rusty, so check me here, but we need the 
eigendecomposition of the covariance matrix 1/(n-1) * mat' * mat. I get how the 
singular values here give these eigenvalues. Don't the singular vectors V need 
to be divided by n-1 too? I'm probably wrong about it, just checking, esp. as 
we don't actually have tests for any of the output here!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228713925
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala ---
@@ -54,4 +55,14 @@ class PCASuite extends SparkFunSuite with 
MLlibTestSparkContext {
 // check overflowing
 assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue)
   }
+
+  test("number of features more than 65500") {
+val rows = 10
+val columns = 10
+val k = 5
+val randomRDD = RandomRDDs.normalVectorRDD(sc, rows, columns, 0, 0)
+val pca = new PCA(k).fit(randomRDD)
+assert(pca.explainedVariance.size === 5)
+assert(pca.pc.numRows === 10 && pca.pc.numCols === 5)
--- End diff --

I wonder if there's any reasonable way to check the answer here, like some 
bounds on what the eigenvalues/eigenvectors should be? Like the eigenvalues 
would at least be positive?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228713878
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala ---
@@ -54,4 +55,14 @@ class PCASuite extends SparkFunSuite with 
MLlibTestSparkContext {
 // check overflowing
 assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue)
   }
+
+  test("number of features more than 65500") {
--- End diff --

65535?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22849: [SPARK-25852][Core] we should filter the workOffe...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22849#discussion_r228713414
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -240,7 +240,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
 // Filter out executors under killing
 val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
-val workOffers = activeExecutors.map {
+val workOffers = activeExecutors.filter(_._2.freeCores > 0).map {
--- End diff --

I don't know this code well but the comment above implies that it means to 
make an offer on all executors?
What is the performance impact anyway?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...

2018-10-27 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22813
  
I don't think that's a reasonable usage scenario. That said is there any 
harm to this change? would it ever miss cleaning something up that it should?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22860: Branch 2.4

2018-10-27 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22860
  
@sarojchand close this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22859: Branch 2.2

2018-10-27 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22859
  
@sarojchand close this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22425: [SPARK-23367][Build] Include python document style check...

2018-10-27 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22425
  
Merged to master. Anyone who feels like then addressing the style warnings, 
some that may be easy to address, go ahead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22425: [SPARK-23367][Build] Include python document styl...

2018-10-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22425#discussion_r228712794
  
--- Diff: dev/lint-python ---
@@ -99,6 +104,29 @@ else
 echo "flake8 checks passed."
 fi
 
+# Check python document style, skip check if pydocstyle is not installed.
--- End diff --

Actually, seems like indentation is inconsistent in this file, so let's 
leave it for now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22850: [MINOR][DOC] Fix comment error of HiveUtils

2018-10-27 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22850
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22815: [SPARK-25821][SQL] Remove SQLContext methods deprecated ...

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22815
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22848: [SPARK-25851][SQL][MINOR] Fix deprecated API warning in ...

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22848
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22852
  
I don't feel strongly about it; go ahead.

If someone lands on this page, do they pretty easily come away with the 
impression they need to set spark.authenticate and network security if they 
care about security? if so, great. If the text is just adding to the text they 
might skip over, maybe revise it. That's how I think about it.

I think you can make edits for Mesos and K8S here too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22854: [SPARK-25854] fix mvn to not always exit 1

2018-10-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22854#discussion_r228583684
  
--- Diff: build/mvn ---
@@ -163,8 +163,19 @@ export MAVEN_OPTS=${MAVEN_OPTS:-"$_COMPILE_JVM_OPTS"}
 
 echo "Using \`mvn\` from path: $MVN_BIN" 1>&2
 
-# Last, call the `mvn` command as usual
+# call the `mvn` command as usual
 "${MVN_BIN}" -DzincPort=${ZINC_PORT} "$@"
 
-# Try to shut down zinc explicitly
-"${ZINC_BIN}" -shutdown -port ${ZINC_PORT}
+# check to see if zinc server is still running post-build
+"${ZINC_BIN}" -status -port ${ZINC_PORT} &> /dev/null
+ZINC_STATUS=$?
+
+# Try to shut down zinc explicitly if the server is still running
+if [ $ZINC_STATUS -eq 0 ]; then
+  # zinc is still running!
+  "${ZINC_BIN}" -shutdown -port ${ZINC_PORT}
+  exit 0
--- End diff --

I suppose you just want to exit 0 outside the if-else, for clarity, but 
whatever. This is all fine and can be back-ported back to 2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22802: [SPARK-25806][SQL]The instance of FileSplit is redundant...

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22802
  
OK, that's pretty trivial, but I agree. Any other instances of this type of 
thing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22843: [SPARK-16693][SPARKR] Remove methods deprecated

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22843
  
I personally am OK with removing anything deprecated before 2.4. Even 
things deprecated in 2.4 are technically fair game. If any of these were 
deprecated pretty recently, might be debatable, but I am in favor, myself.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22826: [SPARK-25760][DOCS][FOLLOWUP] Add note about AddJar retu...

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22826
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22815: [SPARK-25821][SQL] Remove SQLContext methods depr...

2018-10-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22815#discussion_r228554831
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -54,6 +54,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager
  * @groupname Ungrouped Support functions for language integrated queries
  * @since 1.0.0
  */
+@deprecated("Use SparkSession instead", "3.0.0")
--- End diff --

One more very late question for 2.4 -- @rxin is it even better to deprecate 
this right now in 2.4.0? and in Python, R?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22848: [SPARK-25851][SQL][MINOR] Fix deprecated API warn...

2018-10-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22848#discussion_r228553964
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala ---
@@ -89,12 +89,12 @@ private class LongLongTupleConverter extends 
Converter[(Object, Object), (Long,
   }
 
   override def getInputType(typeFactory: TypeFactory): JavaType = {
-val objectType = typeFactory.uncheckedSimpleType(classOf[Object])
-typeFactory.constructSimpleType(classOf[(_, _)], classOf[(_, _)], 
Array(objectType, objectType))
+val objectType = typeFactory.constructType(classOf[Object])
+typeFactory.constructSimpleType(classOf[(_, _)], Array(objectType, 
objectType))
   }
 
   override def getOutputType(typeFactory: TypeFactory): JavaType = {
-val longType = typeFactory.uncheckedSimpleType(classOf[Long])
-typeFactory.constructSimpleType(classOf[(_, _)], classOf[(_, _)], 
Array(longType, longType))
+val longType = typeFactory.constructType(classOf[(Long)])
--- End diff --

Are the parentheses around `Long` needed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22852: [SPARK-25023] Clarify Spark security documentation

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22852
  
I get it, the "it's your responsibility" stance, and it is. For any risk 
there's a sentence in this doc we could point to and say, "see, told you". If 
we're going to make a change here, adding another para saying "below, we told 
you so" isn't additive. Especially if we're trying to use this change to 
actively mitigate security issues. More useful is a cheat-sheet, TL;DR, simply 
enumerating the top things you don't want to miss. I think it's more useful 
than redundant. I can take a crack at that too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22723: [SPARK-25729][CORE]It is better to replace `minPartition...

2018-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22723
  
Kind of. As you say it lets a particular `InputFormat` decide this. But it 
still uses the minPartitions as input. See again 
https://issues.apache.org/jira/browse/SPARK-22357 for some subtlety, and that 
much I agree with. I think that's the substance of your change too. But then, 
if that's true, isn't this a no-op for the case you're interested in? it does 
affect, then, other `InputFormat`s, but I'm not clear that's valid. Or then: 
why just make this change in one place?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >