[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-12-10 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22683
  
In this case the change is simpler to understand in prose, I think; "100 
KB" becomes "97.6 KiB", etc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23275: [SPARK-26323][SQL] Scala UDF should still check i...

2018-12-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23275#discussion_r240234573
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -47,25 +47,13 @@ case class ScalaUDF(
 function: AnyRef,
 dataType: DataType,
 children: Seq[Expression],
-inputsNullSafe: Seq[Boolean],
-inputTypes: Seq[DataType] = Nil,
+@transient inputsNullSafe: Seq[Boolean],
+@transient inputTypes: Seq[AbstractDataType] = Nil,
 udfName: Option[String] = None,
 nullable: Boolean = true,
 udfDeterministic: Boolean = true)
   extends Expression with ImplicitCastInputTypes with NonSQLExpression 
with UserDefinedExpression {
 
-  // The constructor for SPARK 2.1 and 2.2
--- End diff --

I'm OK removing it even without a formal deprecation; these versions are EOL


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23275: [SPARK-26323][SQL] Scala UDF should still check i...

2018-12-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23275#discussion_r240234371
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -47,25 +47,13 @@ case class ScalaUDF(
 function: AnyRef,
 dataType: DataType,
 children: Seq[Expression],
-inputsNullSafe: Seq[Boolean],
-inputTypes: Seq[DataType] = Nil,
+@transient inputsNullSafe: Seq[Boolean],
--- End diff --

What was the need for this one? does this object get caught in a closure 
somewhere?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-09 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23260
  
Ok, got it. @vanzin or @squito or others would be better able to evaluate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-09 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23072
  
@dongjoon-hyun @felixcheung how about now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-09 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23260
  
If you're on YARN, this feels like something you would manage via YARN and 
its cluster management options. Is there a specific use case here, that this 
has to happen in Spark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-09 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23263
  
My first impression is that it's a big change, which is reason for caution 
here.

Visualizing a workflow is nice, but Spark's Pipelines are typically pretty 
straightforward and linear. I could imagine producing a nicer visualization 
than what you get from reading the Spark UI, although of course we already have 
some degree of history and data there.

These are just the hooks, right? someone would have to implement something 
to use these events. I see the value in the API to some degree, but with no 
concrete implementation, does it add anything for Spark users out of the box?

It seems like the history this generates would belong in the history 
server, although that already has a pretty particular purpose, storing granular 
history of events in Spark. Is that what someone would likely do? or would 
someone likely have to run Atlas to use this? If that's a good example of the 
use case, and Atlas is really about lineage and governance, is that the thrust 
of this change, to help with something to do with model lineage and 
reproducibility?

It's good that the API changes little, though it does change a bit.

I think I mostly have questions right now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23241: [SPARK-26283][CORE] Enable reading from open frames of z...

2018-12-09 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23241
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-12-08 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22683
  
Fortunately the syntax is "100m", which has always meant "100 * 1024 * 
1024" or "100 MiB"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (bran...

2018-12-08 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/23264


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23264: Update to Scala 2.12.8

2018-12-08 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/23264

Update to Scala 2.12.8

## What changes were proposed in this pull request?

Back-port of https://github.com/apache/spark/pull/23218 ; updates Scala 
2.12 build to 2.12.8

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-26266.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23264.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23264


commit 4427e9f183f5f0aae7e32643508b8a3b1c9bf234
Author: Sean Owen 
Date:   2018-12-08T12:09:30Z

Update to Scala 2.12.8




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-08 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23218
  
Merged to master. I'll open a separate PR for branch-2.4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-08 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/23218


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...

2018-12-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23225#discussion_r239990036
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -161,6 +161,10 @@ private void writeSortedFile(boolean isLastFile) {
 final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
   inMemSorter.getSortedIterator();
 
+// If there are no sorted records, so we don't need to create an empty 
spill file.
+if (!sortedRecords.hasNext()) {
+  return;
+}
--- End diff --

If you're going to short-circuit, why not do it at the start of the 
function and save the rest of the work done above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...

2018-12-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23256
  
CC @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23160: [SPARK-26196][SPARK-26281][WebUI] Total tasks title in t...

2018-12-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23160
  
Merged to master. @shahidki31 does this need to go in branch 2.4, 2.3? I 
tried back porting it, but looks like a lot of the affected code didn't exist 
in 2.4. If the fix can or should also be back-ported and you're willing, you're 
welcome to open another PR against 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23247: [SPARK-26294][CORE]Delete Unnecessary If statement

2018-12-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23247
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...

2018-12-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/17673
  
@ngopal this one can't be merged as-is and looks like it was abandoned. 
Would you like to take this PR, update per reviews? I'd review that. I think 
CBOW could be useful in MLlib.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23218
  
I'm not sure. I can't find any other reference to this crash and 2.12.8. It 
could be something only Spark happens to trigger, or could be specific to this 
JVM + platform but not Spark or Scala. We could drop a release note at least 
recommending the latest version of Java 8 (and 11) with Spark 2.4 / 3.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23241: [SPARK-26283][CORE] Enable reading from open frames of z...

2018-12-06 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23241
  
Sorry about the run-around. I'm OK being conservative here as you were 
originally, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23246: [SPARK-26292][CORE]Assert statement of currentPage may b...

2018-12-06 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23246
  
It's not clear this is where it should be from the description. Please 
review https://spark.apache.org/contributing.html This one should be closed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23247: [SPARK-26294][CORE]Delete Unnecessary If statement

2018-12-06 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23247
  
These aren't worth the time it takes us to review them and merge them, 
honestly. Little cleanup can be OK if it makes an appreciable difference in 
speed or readability, and if you can find many instances of the same issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239590339
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala ---
@@ -118,8 +118,6 @@ private[spark] class ReplayListenerBus extends 
SparkListenerBus with Logging {
   case e: HaltReplayException =>
 // Just stop replay.
   case _: EOFException if maybeTruncated =>
-  case _: IOException if maybeTruncated =>
--- End diff --

OK, this could be OK, if this was really added only to address what you are 
fixing here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239532748
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala ---
@@ -118,10 +118,12 @@ private[spark] class ReplayListenerBus extends 
SparkListenerBus with Logging {
   case e: HaltReplayException =>
 // Just stop replay.
   case _: EOFException if maybeTruncated =>
-  case _: IOException if maybeTruncated =>
-logWarning(s"Failed to read Spark event log: $sourceName")
   case ioe: IOException =>
-throw ioe
+if (maybeTruncated) {
--- End diff --

Oh I see. Actually, what about just removing the second case? it's simpler 
to just let it throw.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239525888
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala ---
@@ -118,10 +118,12 @@ private[spark] class ReplayListenerBus extends 
SparkListenerBus with Logging {
   case e: HaltReplayException =>
 // Just stop replay.
   case _: EOFException if maybeTruncated =>
-  case _: IOException if maybeTruncated =>
-logWarning(s"Failed to read Spark event log: $sourceName")
   case ioe: IOException =>
-throw ioe
+if (maybeTruncated) {
--- End diff --

I think this was already the behavior? if it doesn't match the 'if' it 
would just throw anyway


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-06 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23202
  
I'd defer to @HyukjinKwon ; looks OK in broad strokes but he would know 
much more about the CSV parsing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239509724
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends 
CompressionCodec {
 // avoid overhead excessive of JNI call while trying to uncompress 
small amount of data.
 new BufferedInputStream(new ZstdInputStream(s), bufferSize)
   }
+
+  override def zstdEventLogCompressedInputStream(s: InputStream): 
InputStream = {
+new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), 
bufferSize)
--- End diff --

That's what I'm wondering about. Is it actually desirable to not fail on a 
partial frame? I'm not sure. We *shouldn't* encounter it elsewhere.

This changes a developer API, but may not even be a breaking change as 
there is a default implementation. We can take breaking changes in Spark 3 
though.

I think I agree with your approach here in the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22683#discussion_r239482033
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullExpressionsSuite.scala
 ---
@@ -160,7 +160,7 @@ class NullExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 checkEvaluation(AtLeastNNonNulls(4, nullOnly), false, EmptyRow)
   }
 
-  test("Coalesce should not throw 64kb exception") {
+  test("Coalesce should not throw 64kib exception") {
--- End diff --

Nit: 64 KiB


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22683#discussion_r239480795
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -62,14 +62,14 @@ class KryoSerializer(conf: SparkConf)
 
   if (bufferSizeKb >= ByteUnit.GiB.toKiB(2)) {
 throw new IllegalArgumentException("spark.kryoserializer.buffer must 
be less than " +
-  s"2048 mb, got: + ${ByteUnit.KiB.toMiB(bufferSizeKb)} mb.")
+  s"2048 mib, got: + ${ByteUnit.KiB.toMiB(bufferSizeKb)} mib.")
--- End diff --

Nit: mib -> MiB. (I know it was 'mb' before but that's not really the 
abbreviation for 1 million bytes)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22683#discussion_r239481508
  
--- Diff: docs/sql-programming-guide.md ---
@@ -4,10 +4,15 @@ displayTitle: Spark SQL, DataFrames and Datasets Guide
 title: Spark SQL and DataFrames
 ---
 
+* This will become a table of contents (this text will be scraped).
--- End diff --

The change to this file looks unrelated. Could you revert it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239476570
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends 
CompressionCodec {
 // avoid overhead excessive of JNI call while trying to uncompress 
small amount of data.
 new BufferedInputStream(new ZstdInputStream(s), bufferSize)
   }
+
+  override def zstdEventLogCompressedInputStream(s: InputStream): 
InputStream = {
+new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), 
bufferSize)
--- End diff --

BTW it seems like 'continuous' changes behavior very little: 
https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/ZstdInputStream.java#L147
 I agree with your concern to keep the change minimal. I'm trying to think if 
this would break anything if everything were read as 'continuous'. It wouldn't 
fail fast in some case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE] Enable reading from open fram...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239476672
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala ---
@@ -118,8 +118,6 @@ private[spark] class ReplayListenerBus extends 
SparkListenerBus with Logging {
   case e: HaltReplayException =>
 // Just stop replay.
   case _: EOFException if maybeTruncated =>
-  case _: IOException if maybeTruncated =>
--- End diff --

Can this still happen for non-zstd compression though?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239474962
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
--- End diff --

Nit: could we indent these at the same level?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239475332
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
+StructField("cl2", IntegerType, nullable = true) :: Nil)
+val row = Row(3, 4)
+val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), 
schema)
+
+withTempPath { dir =>
+  val path = dir.getAbsolutePath
+  df.write.mode("overwrite").parquet(path)
+  val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0)
+
+  val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new 
Configuration())
+  val f = ParquetFileReader.open(hadoopInputFile)
+  val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala
+  .map(_.getPrimitiveType)
+  f.close
+
+  // the write keeps nullable info from the schema
+  val expectedParquetSchema: Seq[PrimitiveType] = Seq(
--- End diff --

Also really doesn't matter, but you can simplify the code by omitting types 
like this, etc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239475203
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
+StructField("cl2", IntegerType, nullable = true) :: Nil)
+val row = Row(3, 4)
+val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), 
schema)
+
+withTempPath { dir =>
+  val path = dir.getAbsolutePath
+  df.write.mode("overwrite").parquet(path)
+  val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0)
+
+  val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new 
Configuration())
+  val f = ParquetFileReader.open(hadoopInputFile)
+  val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala
+  .map(_.getPrimitiveType)
+  f.close
+
+  // the write keeps nullable info from the schema
+  val expectedParquetSchema: Seq[PrimitiveType] = Seq(
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.INT32, 
"cl1"),
+new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.INT32, 
"cl2")
+  )
+
+  assert (expectedParquetSchema == parquetSchema)
--- End diff --

Nit: I think ideally you use the `===` test operator, so that failures 
generated a better message


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE]When zstd compression enabled,...

2018-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239218209
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends 
CompressionCodec {
 // avoid overhead excessive of JNI call while trying to uncompress 
small amount of data.
 new BufferedInputStream(new ZstdInputStream(s), bufferSize)
   }
+
+  override def zstdEventLogCompressedInputStream(s: InputStream): 
InputStream = {
+new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), 
bufferSize)
--- End diff --

Yeah, so this new "partial file" method can call the existing method by 
default, and do something different for zstd. Then this one call site can ask 
for the 'partial file' stream. Some comments about the difference here would be 
helpful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...

2018-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23228#discussion_r239215561
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala ---
@@ -33,10 +33,10 @@ import org.apache.spark.shuffle._
  * Sort-based shuffle has two different write paths for producing its map 
output files:
  *
  *  - Serialized sorting: used when all three of the following conditions 
hold:
- *1. The shuffle dependency specifies no aggregation or output 
ordering.
+ *1. The shuffle dependency specifies no map-side combine.
--- End diff --

Does this sound right @JoshRosen ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23241: [SPARK-26283][CORE]When zstd compression enabled,...

2018-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23241#discussion_r239208072
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -197,4 +201,8 @@ class ZStdCompressionCodec(conf: SparkConf) extends 
CompressionCodec {
 // avoid overhead excessive of JNI call while trying to uncompress 
small amount of data.
 new BufferedInputStream(new ZstdInputStream(s), bufferSize)
   }
+
+  override def zstdEventLogCompressedInputStream(s: InputStream): 
InputStream = {
+new BufferedInputStream(new ZstdInputStream(s).setContinuous(true), 
bufferSize)
--- End diff --

What happens if you set continuous for everything? would it work in all 
cases? It kind of looks like zstd always uses this in the code path below 
anyway.

I think that if we introduce a new method we might try to make it a little 
more general, like: compressedInputStreamForPartialFile or something. It would 
be good to avoid the isInstanceOf below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...

2018-12-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23231
  
I'm not seeing it in the migration guide, maybe I'm missing it. In any 
event, I dont' think we need to keep this for 3.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...

2018-12-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23229
  
Agree, this isn't worthwhile.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r239068840
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

I think consistency is indeed a problem, but why disable the new parser, 
rather than make this consistent? I haven't looked into whether there's a good 
reason they behave differently but suspect not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23218
  
Ah OK, so all of them were a JVM crash. It would probably be a good idea to 
update the JVM on all the workers as _60 is over 3 years old. It's probably not 
as simple as it sounds but WDYT @shaneknapp ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23218
  
Hm, one failure was due to a JVM crash, but it fails twice consistent, with 
sbt just exiting with status 134. No other failures are logged. Not sure what 
to make of that!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23216
  
I think just leave it. The `@transient` in `ShuffleMapTasks`'s `locs` is 
just superfluous here, not sure it's worth changing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23216
  
Are you sure it's even a field in the class? it looks like it's only used 
to define this:

```
  @transient private[this] val preferredLocs: Seq[TaskLocation] = {
if (locs == null) Nil else locs.toSet.toSeq
  }
```

I'd expect Scala would not generate a field. Indeed the thing it is used to 
make is transient.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23159: [SPARK-26191][SQL] Control truncation of Spark pl...

2018-12-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23159#discussion_r238869530
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1777,7 +1777,7 @@ class Analyzer(
 
   case p if p.expressions.exists(hasGenerator) =>
 throw new AnalysisException("Generators are not supported outside 
the SELECT clause, but " +
-  "got: " + p.simpleString)
+  "got: " + p.simpleString((SQLConf.get.maxToStringFields)))
--- End diff --

Nit: are there extra parens here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23159
  
Rather than change every single call to this method, if this should 
generally be the value of the argument, then why not make it the default value 
or something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23219: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23219
  
@wangyum I already opened https://github.com/apache/spark/pull/23218


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22759: [MINOR][SQL][DOC] Correct parquet nullability documentat...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22759
  
Ping @dima-asana to rebase or close


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21363
  
@MaxGekk now that your change is merge, can this proceed, @xuanyuanking ? 
or is it obsolete?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22997
  
Yeah, we can't make this change for the reasons above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22887: [SPARK-25880][CORE] user set's hadoop conf should not ov...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22887
  
@gjhkael can you clarify further what the undesirable behavior is, and what 
behavior you are looking for?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support for Scala...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23098
  
Note I'm holding on to this PR for a while as I understand it might be 
disruptive to downstream builds to remove 2.11 support just now. Will look at 
merging it in weeks. Right now it's an FYI.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23150
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23218
  
Hm, looks like genjavadocplugin is published for individual Scala releases 
and doesn't exist yet for 2.12.8: 
https://mvnrepository.com/artifact/com.typesafe.genjavadoc/genjavadoc-plugin . 
I'll look at whether we can leave it at 2.12.7 or whether to expect it will be 
released soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23170: [SPARK-24423][FOLLOW-UP][SQL] Fix error example

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23170
  
Merged to master/2.4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22600: [SPARK-25578][BUILD] Update to Scala 2.12.7

2018-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22600
  
@wangyum sounds good. I opened https://github.com/apache/spark/pull/23218


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-04 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/23218

[SPARK-26266][BUILD] Update to Scala 2.12.8

## What changes were proposed in this pull request?

Update to Scala 2.12.8

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-26266

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23218


commit b667d37e9ee2d8cdce459806925cdc0fe725b7bf
Author: Sean Owen 
Date:   2018-12-04T13:53:21Z

Update to Scala 2.12.8




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23182: Config change followup to [SPARK-26177] Automated format...

2018-12-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23182
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...

2018-12-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23189
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

2018-12-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18784
  
@skonto do you want to proceed with this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-12-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22683
  
Add to this PR. The change goes logically together.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23205: [SPARK-26253][WebUI] : Task Summary Metrics Table on Sta...

2018-12-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23205
  
Merging as a follow up to https://github.com/apache/spark/pull/21688


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-12-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23164
  
OK merged to 2.4/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...

2018-12-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23088#discussion_r238110710
  
--- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
---
@@ -148,11 +148,20 @@ private[spark] class AppStatusStore(
 // cheaper for disk stores (avoids deserialization).
 val count = {
   Utils.tryWithResource(
-store.view(classOf[TaskDataWrapper])
-  .parent(stageKey)
-  .index(TaskIndexNames.EXEC_RUN_TIME)
-  .first(0L)
-  .closeableIterator()
+if (store.isInstanceOf[LevelDB]) {
--- End diff --

Does this code path need to be different for disk vs memory? this part 
seemed like it could work efficiently either way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...

2018-12-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23088#discussion_r238110723
  
--- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
---
@@ -221,29 +230,49 @@ private[spark] class AppStatusStore(
 // stabilize once the stage finishes. It's also slow, especially with 
disk stores.
 val indices = quantiles.map { q => math.min((q * count).toLong, count 
- 1) }
 
+// TODO Summary metrics needs to display all the successful tasks' 
metrics (SPARK-26119).
--- End diff --

It's not ideal but it's a reasonable solution. Are you OK with it @vanzin ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22683: [SPARK-25696] The storage memory displayed on spa...

2018-12-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22683#discussion_r238102080
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -1164,17 +1164,17 @@ private[spark] object Utils extends Logging {
 } else {
   val (value, unit) = {
 if (size >= 2 * EB) {
-  (BigDecimal(size) / EB, "EB")
+  (BigDecimal(size) / EB, "EiB")
--- End diff --

For full consistency, how about modifying the values like EB and PB above? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-12-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23164
  
I just mean, is this a bug that comes up otherwise in Spark? should this be 
back-ported or is it just supporting the new change you reference? I can merge 
to master at least.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-12-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23150
  
It makes sense that parsing depends on a timezone, though that's set as an 
option in the parser typically. The tests should generally test "GMT" for this 
reason. If there's a default code path for when no timezone is specified, then 
I'd use the test harness mechanisms for temporarily changing the system 
timezone to GMT (which then automatically changes back).

Your changes look OK here and they pass right? is there another test you 
were unable to add?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238073218
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  if (params.headerFlag) {
+val gen = getGen()
+gen.writeHeaders()
+  }
 
+  private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
+
+  override def write(row: InternalRow): Unit = {
+val gen = getGen()
--- End diff --

Yeah we have two different approaches, both of which are fine IMHO. I think 
it's reasonable to clean that up in a follow-up if desired. WDYT @HyukjinKwon ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors

2018-12-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23162
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23178: [SPARK-26216][SQL] Do not use case class as publi...

2018-12-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23178#discussion_r238062995
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -38,114 +38,106 @@ import org.apache.spark.sql.types.DataType
  * @since 1.3.0
  */
 @Stable
--- End diff --

I'd go ahead and leave the Since version. The API is essentially unchanged, 
though there are some marginal breaking compile time changes. But same is true 
of many things we are changing in 3.0. I've tagged the JIRA with 
`release-notes` and will add a blurb about the change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23177: [SPARK-26212][Build][test-maven] Upgrade maven version t...

2018-12-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23177
  
Merging to master. I've been using 3.6.0 on the command line for a while 
and it's fine. Note that if you use your local mvn in IntelliJ, it seems to 
have some incompatibility with the current latest 2018.13.1 release. It's no 
big deal, falling back to its internal 3.3.9 version works fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23185: [MINOR][Docs] Fix typos

2018-11-30 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23185
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23146#discussion_r237888294
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularizationSuite.scala
 ---
@@ -16,9 +16,13 @@
  */
 package org.apache.spark.ml.optim.loss
 
+import org.scalactic.{Equality, TolerantNumerics}
--- End diff --

We have ~== etc for approximate comparison


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23146#discussion_r237888645
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -250,6 +250,66 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   isSet(lowerBoundsOnIntercepts) || isSet(upperBoundsOnIntercepts)
   }
 
+  /**
+   * The prior multivariate mean (coefficients) for Maximum A Posteriori 
(MAP) optimization.
+   * Default is none.
+   *
+   * @group expertParam */
+  @Since("2.4.0")
+  val priorMean: DoubleArrayParam = new DoubleArrayParam(this, "priorMean",
+  "The prior mean used for Prior regularization.")
--- End diff --

These need continuation indents


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23146#discussion_r237888349
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularizationSuite.scala
 ---
@@ -16,9 +16,13 @@
  */
 package org.apache.spark.ml.optim.loss
 
+import org.scalactic.{Equality, TolerantNumerics}
+
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{BLAS, Vectors}
 
+
--- End diff --

Nit: remove these lines


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23146#discussion_r237888585
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -250,6 +250,66 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   isSet(lowerBoundsOnIntercepts) || isSet(upperBoundsOnIntercepts)
   }
 
+  /**
+   * The prior multivariate mean (coefficients) for Maximum A Posteriori 
(MAP) optimization.
+   * Default is none.
+   *
+   * @group expertParam */
+  @Since("2.4.0")
--- End diff --

This would have to be 3.0.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Reuse withTempDir funct...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237886340
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -105,5 +105,16 @@ abstract class SparkFunSuite
   logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n")
 }
   }
-
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   * @todo Probably this method should be moved to a more general place
+   */
+  protected def withCreateTempDir(f: File => Unit): Unit = {
+val dir = Utils.createTempDir()
--- End diff --

Yes shouldn't be necessary here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Reuse withTempDir funct...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237886193
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -105,5 +105,16 @@ abstract class SparkFunSuite
   logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n")
 }
   }
-
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   * @todo Probably this method should be moved to a more general place
+   */
+  protected def withCreateTempDir(f: File => Unit): Unit = {
--- End diff --

Yes, it seems like we should be able to use an override. The subclass that 
needs to inject an additional method call in the block can call the super 
method with a lambda that calls the user-supplied block, then this other 
method. It's probably worth whatever surgery is needed to make this clean and 
reduce duplication. We already have a lot of "create temp thing" methods all 
over.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237883918
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  private def getOrCreateGen(): UnivocityGenerator = 
univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
 
+  if (params.headerFlag) {
--- End diff --

My only nit here is that this is part of the constructor, but lives between 
two methods, which is a little less clear. Maybe move it just after the member 
declarations?

Also you could inline this to things like 
`getOrCreateGen().writeHeaders()`, but doesn't matter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237884151
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  private def getOrCreateGen(): UnivocityGenerator = 
univocityGenerator.getOrElse {
--- End diff --

This is a really small thing, I don't feel strongly about, but what about 
just `getGen()`? the caller doesn't care whether it's created.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23177: [SPARK-26212][Build][test-maven] Upgrade maven ve...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23177#discussion_r237881557
  
--- Diff: pom.xml ---
@@ -114,7 +114,7 @@
 1.8
 ${java.version}
 ${java.version}
-3.5.4
+3.6.0
--- End diff --

That wouldn't hurt, though I suspect the project continues to build with 
Maven 3.5.4 for a long while.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23162#discussion_r237880921
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -513,7 +513,7 @@ package object config {
 "is written in unsafe shuffle writer. In KiB unless otherwise 
specified.")
   .bytesConf(ByteUnit.KiB)
   .checkValue(v => v > 0 && v <= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH / 1024,
-s"The buffer size must be greater than 0 and less than" +
+s"The buffer size must be positive and not greater than" +
--- End diff --

Yeah, I could go either way on those. I wouldn't mind standardizing on 
"less than or equal to", sure. @10110346 would you mind taking one more pass 
accordingly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23148
  
Yeah we'd need a new PR. If you collect a few good improvements just open a 
follow up. I was testing by just making a dummy change in a few files and 
seeing what it did. It's OK as-is, even.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23177: [SPARK-26212][Build][test-maven] Upgrade maven version t...

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23177
  
Ah, that's the second time I've forgotten this. Yes looks good to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23126
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23148
  
I played with this a little locally, and yeah it does reformat entire files 
that are in the diff, and most of what it does is fixing stuff we probably 
wouldn't ask for in a PR review. For example is it possible to disable its 
preference for putting closing braces and parens on the next line? as in the 
comment at https://github.com/apache/spark/pull/23148#issuecomment-442243410 . 
And maybe less aggressive about putting args to a method each on their own 
line. 

There isn't a mode that would somehow just reformat lines that are being 
changed, BTW?

This is already useful in that we can just ask people to run dev/scalafmt 
(I'll update developer guids) as the output style looks _also_ just fine. I 
won't try to have this automatically add the formatting to the build.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23145: [MINOR][Docs][WIP] Fix Typos

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23145
  
I think this is fine to merge, this is a good batch of grammar fixes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23145: [MINOR][Docs][WIP] Fix Typos

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23145#discussion_r237564429
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -1634,7 +1634,7 @@ returned through `Dataset.writeStream()`. You will 
have to specify one or more o
 
 - *Query name:* Optionally, specify a unique name of the query for 
identification.
 
-- *Trigger interval:* Optionally, specify the trigger interval. If it is 
not specified, the system will check for availability of new data as soon as 
the previous processing has completed. If a trigger time is missed because the 
previous processing has not completed, then the system will trigger processing 
immediately.
+- *Trigger interval:* Optionally, specify the trigger interval. If it is 
not specified, the system will check for availability of new data as soon as 
the previous processing has been completed. If a trigger time is missed because 
the previous processing has not been completed, then the system will trigger 
processing immediately.
--- End diff --

"has completed" is actually correct, but the passive voice here is 
grammatical too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23162#discussion_r237563687
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -503,7 +503,7 @@ package object config {
 "made in creating intermediate shuffle files.")
   .bytesConf(ByteUnit.KiB)
   .checkValue(v => v > 0 && v <= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH / 1024,
-s"The file buffer size must be greater than 0 and less than" +
+s"The file buffer size must be greater than 0 and not greater 
than" +
--- End diff --

You can say positive instead of 'greater than 0' if you like, as above


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23162#discussion_r237563435
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -430,8 +430,8 @@ package object config {
   .doc("The chunk size in bytes during writing out the bytes of 
ChunkedByteBuffer.")
   .bytesConf(ByteUnit.BYTE)
   .checkValue(_ <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH,
-"The chunk size during writing out the bytes of" +
-" ChunkedByteBuffer should not larger than Int.MaxValue - 15.")
+"The chunk size during writing out the bytes of ChunkedByteBuffer 
should" +
+  s" not larger than 
${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
--- End diff --

not larger than => be at most
or => not be greater than


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23148
  
Ah I see. I can add that call in a follow-up to enable it and see how we 
like it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23052
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237561245
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala
 ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
  * executor side.  This instance is used to persist rows to this single 
output file.
  */
 abstract class OutputWriter {
+  /** Initializes before writing any rows. Invoked on executor size. */
+  def init(): Unit
--- End diff --

Rather than make subclasses implement as a no-op, just provide that no-op 
impl here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237559695
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -66,6 +66,20 @@ private[sql] trait SQLTestUtils extends SparkFunSuite 
with SQLTestUtilsBase with
 }
   }
 
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   */
+  protected override def withTempDir(f: File => Unit): Unit = {
+val dir = Utils.createTempDir().getCanonicalFile
+try f(dir) finally {
--- End diff --

Why not call the super method with a function that calls f, then 
waitForTasksToFinish()?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237559321
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -494,13 +494,12 @@ class SparkSubmitSuite
   }
 
   test("launch simple application with spark-submit with redaction") {
-val testDir = Utils.createTempDir()
-testDir.deleteOnExit()
-val testDirPath = new Path(testDir.getAbsolutePath())
 val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
 val fileSystem = Utils.getHadoopFileSystem("/",
   SparkHadoopUtil.get.newConfiguration(new SparkConf()))
-try {
+withTempDir { testDir =>
+  testDir.deleteOnExit()
--- End diff --

Although I think this is redundant for temp dirs, you can put this in the 
Utils.createTempDir method and take it out in places like this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23177: [SPARK-26212][Build][test-maven] Upgrade maven version t...

2018-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23177
  
That's fine but we need to also update build/mvn to download and use the 
same version.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23126#discussion_r237556042
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") (
 RowMatrix.triuToFull(n, GU.data)
   }
 
+  private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): 
Matrix = {
+
+val bc = rows.context.broadcast(mean)
+
+// Computes n*(n+1)/2, avoiding overflow in the multiplication.
+// This succeeds when n <= 65535, which is checked above
+val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+val MU = rows.treeAggregate(new BDV[Double](nt))(
+  seqOp = (U, v) => {
+
+val n = v.size
+val na = Array.ofDim[Double](n)
+val means = bc.value
+if (v.isInstanceOf[DenseVector]) {
+  v.foreachActive{(index, value) =>
--- End diff --

Yeah, because it hasn't subtracted the mean from one of the values in the 
Spark vector. that's the general issue with centering a sparse vector: it 
becomes dense!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23126#discussion_r237541497
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") (
 RowMatrix.triuToFull(n, GU.data)
   }
 
+  private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): 
Matrix = {
+
+val bc = rows.context.broadcast(mean)
+
+// Computes n*(n+1)/2, avoiding overflow in the multiplication.
+// This succeeds when n <= 65535, which is checked above
+val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+val MU = rows.treeAggregate(new BDV[Double](nt))(
+  seqOp = (U, v) => {
+
+val n = v.size
+val na = Array.ofDim[Double](n)
+val means = bc.value
+if (v.isInstanceOf[DenseVector]) {
+  v.foreachActive{(index, value) =>
--- End diff --

But isn't it incorrect to not subtract the mean from 0 elements in a sparse 
vector anyway?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >