[GitHub] spark pull request #20286: [SPARK-23119][SS] Minor fixes to V2 streaming API...

2018-01-16 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/20286#discussion_r161976455
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/Offset.java
 ---
@@ -17,12 +17,20 @@
 
 package org.apache.spark.sql.sources.v2.streaming.reader;
 
+import org.apache.spark.annotation.InterfaceStability;
+
 /**
- * An abstract representation of progress through a [[MicroBatchReader]] 
or [[ContinuousReader]].
- * During execution, Offsets provided by the data source implementation 
will be logged and used as
- * restart checkpoints. Sources should provide an Offset implementation 
which they can use to
- * reconstruct the stream position where the offset was taken.
+ * An abstract representation of progress through a {@link 
MicroBatchReader} or
+ * {@link ContinuousReader}.
+ * During execution, offsets provided by the data source implementation 
will be logged and used as
+ * restart checkpoints. Each source should provide an offset 
implementation which the source can use
+ * to reconstruct a position in the stream up to which data has been 
seen/processed.
+ *
+ * Note: This class currently extends {@link 
org.apache.spark.sql.execution.streaming.Offset} to
+ * maintain compatibility with DataSource V1 APIs. This will be extension 
will be removed once we
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20286: [SPARK-23119][SS] Minor fixes to V2 streaming API...

2018-01-16 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/20286#discussion_r161976313
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/MicroBatchReader.java
 ---
@@ -25,7 +26,11 @@
 /**
  * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
  * interface to indicate they allow micro-batch streaming reads.
+ *
+ * Note: This class currently extends {@link BaseStreamingSource} to 
maintain compatibility with
+ * DataSource V1 APIs. This will be extension will be removed once we get 
rid of V1 completely.
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20286: [SPARK-23119][SS] Minor fixes to V2 streaming API...

2018-01-16 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/20286#discussion_r161976371
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/streaming/reader/ContinuousReader.java
 ---
@@ -27,11 +28,15 @@
  * interface to allow reading in a continuous processing mode stream.
  *
  * Implementations must ensure each read task output is a {@link 
ContinuousDataReader}.
+ *
+ * Note: This class currently extends {@link BaseStreamingSource} to 
maintain compatibility with
+ * DataSource V1 APIs. This will be extension will be removed once we get 
rid of V1 completely.
--- End diff --

nit: This ~~will be~~ extension


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20243
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20243
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86235/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20243
  
**[Test build #86235 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86235/testReport)**
 for PR 20243 at commit 
[`516fd4a`](https://github.com/apache/spark/commit/516fd4a919cbc421f0565357ef6696b7b8ba0728).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20243
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20243
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86232/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20243
  
**[Test build #86232 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86232/testReport)**
 for PR 20243 at commit 
[`c0ec93f`](https://github.com/apache/spark/commit/c0ec93f99ab34a4201c1161bc68a5e72e8fe553a).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-16 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161973792
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, numRows: Int, width: Int): Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, title: String, expr: String): 
Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" 
else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Native ORC Vectorized ${if (pushDownEnabled) 
s"(Pushdown)" else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
+Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
+
+Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
+
---
+Parquet Vectorized2091 / 2258  0.5 
   1993.9   1.0X
+Parquet Vectorized (Pushdown)   41 /   

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-16 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161973650
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, numRows: Int, width: Int): Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, title: String, expr: String): 
Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" 
else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Native ORC Vectorized ${if (pushDownEnabled) 
s"(Pushdown)" else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
+Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
+
+Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
+
---
+Parquet Vectorized2091 / 2258  0.5 
   1993.9   1.0X
+Parquet Vectorized (Pushdown)   41 /   

[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...

2018-01-16 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20266
  
Thank you, @cloud-fan , @gatorsmile , and @mgaido91 !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2018-01-16 Thread zuotingbing
Github user zuotingbing commented on the issue:

https://github.com/apache/spark/pull/20025
  
@cloud-fan @gatorsmile @liufengdb @felixcheung @srowen @vanzin  Is anybody 
could make further contact and discuss this PR? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86233/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17298
  
**[Test build #86233 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86233/testReport)**
 for PR 17298 at commit 
[`ae450e8`](https://github.com/apache/spark/commit/ae450e89178f4bfd835d3377e7acbd216f5e7fbb).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161969687
  
--- Diff: python/pyspark/sql/context.py ---
@@ -147,7 +147,8 @@ def udf(self):
 
 :return: :class:`UDFRegistration`
 """
-return UDFRegistration(self)
+from pyspark.sql.session import UDFRegistration
+return UDFRegistration(self.sparkSession)
--- End diff --

How about `return self.sparkSession.udf`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...

2018-01-16 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20225
  
Fix merge conflicts. And add [SS} to the title of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161970902
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala
 ---
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming.sources
+
+import java.util.Optional
+
+import org.apache.spark.sql.{AnalysisException, Row}
+import org.apache.spark.sql.execution.datasources.DataSource
+import org.apache.spark.sql.execution.streaming.{LongOffset, 
RateStreamOffset}
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousTrigger
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.sources.v2.{DataSourceV2, DataSourceV2Options}
+import org.apache.spark.sql.sources.v2.reader.ReadTask
+import org.apache.spark.sql.sources.v2.streaming.{ContinuousReadSupport, 
ContinuousWriteSupport, MicroBatchReadSupport, MicroBatchWriteSupport}
+import org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousReader, 
MicroBatchReader, Offset, PartitionOffset}
+import org.apache.spark.sql.sources.v2.streaming.writer.ContinuousWriter
+import org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer
+import org.apache.spark.sql.streaming.{OutputMode, 
StreamingQueryException, StreamTest, Trigger}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+case class FakeReader() extends MicroBatchReader with ContinuousReader {
+  def setOffsetRange(start: Optional[Offset], end: Optional[Offset]): Unit 
= {}
+  def getStartOffset: Offset = RateStreamOffset(Map())
+  def getEndOffset: Offset = RateStreamOffset(Map())
+  def deserializeOffset(json: String): Offset = RateStreamOffset(Map())
+  def commit(end: Offset): Unit = {}
+  def readSchema(): StructType = StructType(Seq())
+  def stop(): Unit = {}
+  def mergeOffsets(offsets: Array[PartitionOffset]): Offset = 
RateStreamOffset(Map())
+  def setOffset(start: Optional[Offset]): Unit = {}
+
+  def createReadTasks(): java.util.ArrayList[ReadTask[Row]] = {
+throw new IllegalStateException("fake source - cannot actually read")
+  }
+}
+
+trait FakeMicroBatchReadSupport extends MicroBatchReadSupport {
+  override def createMicroBatchReader(
+  schema: Optional[StructType],
+  checkpointLocation: String,
+  options: DataSourceV2Options): MicroBatchReader = FakeReader()
+}
+
+trait FakeContinuousReadSupport extends ContinuousReadSupport {
+  override def createContinuousReader(
+  schema: Optional[StructType],
+  checkpointLocation: String,
+  options: DataSourceV2Options): ContinuousReader = FakeReader()
+}
+
+trait FakeMicroBatchWriteSupport extends MicroBatchWriteSupport {
+  def createMicroBatchWriter(
+  queryId: String,
+  epochId: Long,
+  schema: StructType,
+  mode: OutputMode,
+  options: DataSourceV2Options): Optional[DataSourceV2Writer] = {
+throw new IllegalStateException("fake sink - cannot actually write")
+  }
+}
+
+trait FakeContinuousWriteSupport extends ContinuousWriteSupport {
+  def createContinuousWriter(
+  queryId: String,
+  schema: StructType,
+  mode: OutputMode,
+  options: DataSourceV2Options): Optional[ContinuousWriter] = {
+throw new IllegalStateException("fake sink - cannot actually write")
+  }
+}
+
+class FakeReadMicroBatchOnly extends DataSourceRegister with 
FakeMicroBatchReadSupport {
+  override def shortName(): String = "fake-read-microbatch-only"
+}
+
+class FakeReadContinuousOnly extends DataSourceRegister with 
FakeContinuousReadSupport {
+  override def shortName(): String = "fake-read-continuous-only"
+}
+
+class FakeReadBothModes extends DataSourceRegister
+with FakeMicroBatchReadSupport with FakeContinuousReadSupport {
+  override 

[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161970553
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala
 ---
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming.sources
+
+import java.util.Optional
+
+import org.apache.spark.sql.{AnalysisException, Row}
+import org.apache.spark.sql.execution.datasources.DataSource
+import org.apache.spark.sql.execution.streaming.{LongOffset, 
RateStreamOffset}
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousTrigger
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.sources.v2.{DataSourceV2, DataSourceV2Options}
+import org.apache.spark.sql.sources.v2.reader.ReadTask
+import org.apache.spark.sql.sources.v2.streaming.{ContinuousReadSupport, 
ContinuousWriteSupport, MicroBatchReadSupport, MicroBatchWriteSupport}
+import org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousReader, 
MicroBatchReader, Offset, PartitionOffset}
+import org.apache.spark.sql.sources.v2.streaming.writer.ContinuousWriter
+import org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer
+import org.apache.spark.sql.streaming.{OutputMode, 
StreamingQueryException, StreamTest, Trigger}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+case class FakeReader() extends MicroBatchReader with ContinuousReader {
+  def setOffsetRange(start: Optional[Offset], end: Optional[Offset]): Unit 
= {}
+  def getStartOffset: Offset = RateStreamOffset(Map())
+  def getEndOffset: Offset = RateStreamOffset(Map())
+  def deserializeOffset(json: String): Offset = RateStreamOffset(Map())
+  def commit(end: Offset): Unit = {}
+  def readSchema(): StructType = StructType(Seq())
+  def stop(): Unit = {}
+  def mergeOffsets(offsets: Array[PartitionOffset]): Offset = 
RateStreamOffset(Map())
+  def setOffset(start: Optional[Offset]): Unit = {}
+
+  def createReadTasks(): java.util.ArrayList[ReadTask[Row]] = {
+throw new IllegalStateException("fake source - cannot actually read")
+  }
+}
+
+trait FakeMicroBatchReadSupport extends MicroBatchReadSupport {
+  override def createMicroBatchReader(
+  schema: Optional[StructType],
+  checkpointLocation: String,
+  options: DataSourceV2Options): MicroBatchReader = FakeReader()
+}
+
+trait FakeContinuousReadSupport extends ContinuousReadSupport {
+  override def createContinuousReader(
+  schema: Optional[StructType],
+  checkpointLocation: String,
+  options: DataSourceV2Options): ContinuousReader = FakeReader()
+}
+
+trait FakeMicroBatchWriteSupport extends MicroBatchWriteSupport {
+  def createMicroBatchWriter(
+  queryId: String,
+  epochId: Long,
+  schema: StructType,
+  mode: OutputMode,
+  options: DataSourceV2Options): Optional[DataSourceV2Writer] = {
+throw new IllegalStateException("fake sink - cannot actually write")
+  }
+}
+
+trait FakeContinuousWriteSupport extends ContinuousWriteSupport {
+  def createContinuousWriter(
+  queryId: String,
+  schema: StructType,
+  mode: OutputMode,
+  options: DataSourceV2Options): Optional[ContinuousWriter] = {
+throw new IllegalStateException("fake sink - cannot actually write")
+  }
+}
+
+class FakeReadMicroBatchOnly extends DataSourceRegister with 
FakeMicroBatchReadSupport {
+  override def shortName(): String = "fake-read-microbatch-only"
+}
+
+class FakeReadContinuousOnly extends DataSourceRegister with 
FakeContinuousReadSupport {
+  override def shortName(): String = "fake-read-continuous-only"
+}
+
+class FakeReadBothModes extends DataSourceRegister
+with FakeMicroBatchReadSupport with FakeContinuousReadSupport {
+  override 

[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161966469
  
--- Diff: python/pyspark/sql/session.py ---
@@ -778,6 +778,146 @@ def __exit__(self, exc_type, exc_val, exc_tb):
 self.stop()
 
 
+class UDFRegistration(object):
+"""Wrapper for user-defined function registration."""
+
+def __init__(self, sparkSession):
+self.sparkSession = sparkSession
+
+@ignore_unicode_prefix
+def register(self, name, f, returnType=None):
+"""Registers a Python function (including lambda function) or a 
user-defined function
+in SQL statements.
+
+:param name: name of the user-defined function in SQL statements.
+:param f: a Python function, or a user-defined function. The 
user-defined function can
+be either row-at-a-time or vectorized. See 
:meth:`pyspark.sql.functions.udf` and
+:meth:`pyspark.sql.functions.pandas_udf`.
+:param returnType: the return type of the registered user-defined 
function.
+:return: a user-defined function.
+
+`returnType` can be optionally specified when `f` is a Python 
function but not
+when `f` is a user-defined function. See below:
+
+1. When `f` is a Python function, `returnType` defaults to string 
type and can be
+optionally specified. The produced object must match the specified 
type. In this case,
+this API works as if `register(name, f, returnType=StringType())`.
+
+>>> strlen = spark.udf.register("stringLengthString", lambda 
x: len(x))
+>>> spark.sql("SELECT stringLengthString('test')").collect()
+[Row(stringLengthString(test)=u'4')]
+
+>>> spark.sql("SELECT 'foo' AS 
text").select(strlen("text")).collect()
+[Row(stringLengthString(text)=u'3')]
+
+>>> from pyspark.sql.types import IntegerType
+>>> _ = spark.udf.register("stringLengthInt", lambda x: 
len(x), IntegerType())
+>>> spark.sql("SELECT stringLengthInt('test')").collect()
+[Row(stringLengthInt(test)=4)]
+
+>>> from pyspark.sql.types import IntegerType
+>>> _ = spark.udf.register("stringLengthInt", lambda x: 
len(x), IntegerType())
+>>> spark.sql("SELECT stringLengthInt('test')").collect()
+[Row(stringLengthInt(test)=4)]
+
+2. When `f` is a user-defined function, Spark uses the return type 
of the given a
+user-defined function as the return type of the registered a 
user-defined function.
--- End diff --

the registered a user-defined function -> the registered user-defined 
function


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161964247
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -224,92 +225,18 @@ def dropGlobalTempView(self, viewName):
 """
 self._jcatalog.dropGlobalTempView(viewName)
 
-@ignore_unicode_prefix
-@since(2.0)
 def registerFunction(self, name, f, returnType=None):
-"""Registers a Python function (including lambda function) or a 
:class:`UserDefinedFunction`
-as a UDF. The registered UDF can be used in SQL statements.
-
-:func:`spark.udf.register` is an alias for 
:func:`spark.catalog.registerFunction`.
-
-In addition to a name and the function itself, `returnType` can be 
optionally specified.
-1) When f is a Python function, `returnType` defaults to a string. 
The produced object must
-match the specified type. 2) When f is a 
:class:`UserDefinedFunction`, Spark uses the return
-type of the given UDF as the return type of the registered UDF. 
The input parameter
-`returnType` is None by default. If given by users, the value must 
be None.
-
-:param name: name of the UDF in SQL statements.
-:param f: a Python function, or a wrapped/native 
UserDefinedFunction. The UDF can be either
-row-at-a-time or vectorized.
-:param returnType: the return type of the registered UDF.
-:return: a wrapped/native :class:`UserDefinedFunction`
-
->>> strlen = spark.catalog.registerFunction("stringLengthString", 
len)
->>> spark.sql("SELECT stringLengthString('test')").collect()
-[Row(stringLengthString(test)=u'4')]
-
->>> spark.sql("SELECT 'foo' AS 
text").select(strlen("text")).collect()
-[Row(stringLengthString(text)=u'3')]
-
->>> from pyspark.sql.types import IntegerType
->>> _ = spark.catalog.registerFunction("stringLengthInt", len, 
IntegerType())
->>> spark.sql("SELECT stringLengthInt('test')").collect()
-[Row(stringLengthInt(test)=4)]
-
->>> from pyspark.sql.types import IntegerType
->>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
->>> spark.sql("SELECT stringLengthInt('test')").collect()
-[Row(stringLengthInt(test)=4)]
-
->>> from pyspark.sql.types import IntegerType
->>> from pyspark.sql.functions import udf
->>> slen = udf(lambda s: len(s), IntegerType())
->>> _ = spark.udf.register("slen", slen)
->>> spark.sql("SELECT slen('test')").collect()
-[Row(slen(test)=4)]
-
->>> import random
->>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType
->>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> new_random_udf = spark.catalog.registerFunction("random_udf", 
random_udf)
->>> spark.sql("SELECT random_udf()").collect()  # doctest: +SKIP
-[Row(random_udf()=82)]
->>> spark.range(1).select(new_random_udf()).collect()  # doctest: 
+SKIP
-[Row(()=26)]
-
->>> from pyspark.sql.functions import pandas_udf, PandasUDFType
->>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
-... def add_one(x):
-... return x + 1
-...
->>> _ = spark.udf.register("add_one", add_one)  # doctest: +SKIP
->>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
-[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
-"""
-
-# This is to check whether the input function is a wrapped/native 
UserDefinedFunction
-if hasattr(f, 'asNondeterministic'):
-if returnType is not None:
-raise TypeError(
-"Invalid returnType: None is expected when f is a 
UserDefinedFunction, "
-"but got %s." % returnType)
-if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
-  PythonEvalType.SQL_PANDAS_SCALAR_UDF]:
-raise ValueError(
-"Invalid f: f must be either SQL_BATCHED_UDF or 
SQL_PANDAS_SCALAR_UDF")
-register_udf = UserDefinedFunction(f.func, 
returnType=f.returnType, name=name,
-   evalType=f.evalType,
-   
deterministic=f.deterministic)
-return_udf = f
-else:
-if returnType is None:
-returnType = StringType()
-register_udf = 

[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161964383
  
--- Diff: python/pyspark/sql/context.py ---
@@ -147,7 +147,8 @@ def udf(self):
 
 :return: :class:`UDFRegistration`
 """
-return UDFRegistration(self)
+from pyspark.sql.session import UDFRegistration
--- End diff --

Why we import `UDFRegistration` here again? Isn't it imported at the top?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161966507
  
--- Diff: python/pyspark/sql/session.py ---
@@ -778,6 +778,146 @@ def __exit__(self, exc_type, exc_val, exc_tb):
 self.stop()
 
 
+class UDFRegistration(object):
+"""Wrapper for user-defined function registration."""
+
+def __init__(self, sparkSession):
+self.sparkSession = sparkSession
+
+@ignore_unicode_prefix
+def register(self, name, f, returnType=None):
+"""Registers a Python function (including lambda function) or a 
user-defined function
+in SQL statements.
+
+:param name: name of the user-defined function in SQL statements.
+:param f: a Python function, or a user-defined function. The 
user-defined function can
+be either row-at-a-time or vectorized. See 
:meth:`pyspark.sql.functions.udf` and
+:meth:`pyspark.sql.functions.pandas_udf`.
+:param returnType: the return type of the registered user-defined 
function.
+:return: a user-defined function.
+
+`returnType` can be optionally specified when `f` is a Python 
function but not
+when `f` is a user-defined function. See below:
+
+1. When `f` is a Python function, `returnType` defaults to string 
type and can be
+optionally specified. The produced object must match the specified 
type. In this case,
+this API works as if `register(name, f, returnType=StringType())`.
+
+>>> strlen = spark.udf.register("stringLengthString", lambda 
x: len(x))
+>>> spark.sql("SELECT stringLengthString('test')").collect()
+[Row(stringLengthString(test)=u'4')]
+
+>>> spark.sql("SELECT 'foo' AS 
text").select(strlen("text")).collect()
+[Row(stringLengthString(text)=u'3')]
+
+>>> from pyspark.sql.types import IntegerType
+>>> _ = spark.udf.register("stringLengthInt", lambda x: 
len(x), IntegerType())
+>>> spark.sql("SELECT stringLengthInt('test')").collect()
+[Row(stringLengthInt(test)=4)]
+
+>>> from pyspark.sql.types import IntegerType
+>>> _ = spark.udf.register("stringLengthInt", lambda x: 
len(x), IntegerType())
+>>> spark.sql("SELECT stringLengthInt('test')").collect()
+[Row(stringLengthInt(test)=4)]
+
+2. When `f` is a user-defined function, Spark uses the return type 
of the given a
--- End diff --

of the given a user-defined function -> of the given user-defined function



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161969954
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -280,14 +280,12 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 useTempCheckpointLocation = true,
 trigger = trigger)
 } else {
-  val sink = trigger match {
-case _: ContinuousTrigger =>
-  val ds = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
-  ds.newInstance() match {
-case w: ContinuousWriteSupport => w
-case _ => throw new AnalysisException(
-  s"Data source $source does not support continuous writing")
-  }
+  val ds = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
+  val sink = (ds.newInstance(), trigger) match {
+case (w: ContinuousWriteSupport, _: ContinuousTrigger) => w
+case (_, _: ContinuousTrigger) => throw new 
UnsupportedOperationException(
+s"Data source $source does not support continuous writing")
+case (w: MicroBatchWriteSupport, _) => w
--- End diff --

Isnt there a case where it does not have MicroBatchWriteSupport, but the 
trigger is ProcessingTime/OneTime? That should have a different error message. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20282: [SPARK-23093][SS] Don't change run id when reconfiguring...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20282
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86236/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20282: [SPARK-23093][SS] Don't change run id when reconfiguring...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20282
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20282: [SPARK-23093][SS] Don't change run id when reconfiguring...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20282
  
**[Test build #86236 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86236/testReport)**
 for PR 20282 at commit 
[`70917a5`](https://github.com/apache/spark/commit/70917a59c052f4b9a3859b03528194b9d16d384e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaContinuousReader(`
  * `case class KafkaContinuousReadTask(`
  * `class KafkaContinuousDataReader(`
  * `class KafkaContinuousWriter(`
  * `case class KafkaContinuousWriterFactory(`
  * `class KafkaContinuousDataWriter(`
  * `case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, 
Long]) extends OffsetV2 `
  * `case class KafkaSourcePartitionOffset(topicPartition: TopicPartition, 
partitionOffset: Long)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161969566
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -280,14 +280,12 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 useTempCheckpointLocation = true,
 trigger = trigger)
 } else {
-  val sink = trigger match {
-case _: ContinuousTrigger =>
-  val ds = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
-  ds.newInstance() match {
-case w: ContinuousWriteSupport => w
-case _ => throw new AnalysisException(
-  s"Data source $source does not support continuous writing")
-  }
+  val ds = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
+  val sink = (ds.newInstance(), trigger) match {
+case (w: ContinuousWriteSupport, _: ContinuousTrigger) => w
+case (_, _: ContinuousTrigger) => throw new 
UnsupportedOperationException(
--- End diff --

AnalysisException.
Incorrect trigger or incompatible data source is not an operation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161968219
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala
 ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources
+
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.sources.v2.writer.{DataWriter, 
DataWriterFactory, WriterCommitMessage}
+
+/**
+ * A simple [[DataWriterFactory]] whose tasks just pack rows into the 
commit message for delivery
+ * to a [[org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer]] on 
the driver.
+ */
+case object PackedRowWriterFactory extends DataWriterFactory[Row] {
+  def createDataWriter(partitionId: Int, attemptNumber: Int): 
DataWriter[Row] = {
+new PackedRowDataWriter()
+  }
+}
+
+case class PackedRowCommitMessage(rows: Array[Row]) extends 
WriterCommitMessage
+
+class PackedRowDataWriter() extends DataWriter[Row] with Logging {
+  private val data = mutable.Buffer[Row]()
+
+  override def write(row: Row): Unit = data.append(row)
+
+  override def commit(): PackedRowCommitMessage = {
+val msg = PackedRowCommitMessage(data.clone().toArray)
--- End diff --

why are you cloning and then calling toArray? Just toArray will create an 
immutable copy.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161967927
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala
 ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources
+
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.sources.v2.writer.{DataWriter, 
DataWriterFactory, WriterCommitMessage}
+
+/**
+ * A simple [[DataWriterFactory]] whose tasks just pack rows into the 
commit message for delivery
+ * to a [[org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer]] on 
the driver.
+ */
+case object PackedRowWriterFactory extends DataWriterFactory[Row] {
+  def createDataWriter(partitionId: Int, attemptNumber: Int): 
DataWriter[Row] = {
+new PackedRowDataWriter()
+  }
+}
+
+case class PackedRowCommitMessage(rows: Array[Row]) extends 
WriterCommitMessage
+
+class PackedRowDataWriter() extends DataWriter[Row] with Logging {
--- End diff --

add docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161967938
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/PackedRowWriterFactory.scala
 ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources
+
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.sources.v2.writer.{DataWriter, 
DataWriterFactory, WriterCommitMessage}
+
+/**
+ * A simple [[DataWriterFactory]] whose tasks just pack rows into the 
commit message for delivery
+ * to a [[org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer]] on 
the driver.
+ */
+case object PackedRowWriterFactory extends DataWriterFactory[Row] {
+  def createDataWriter(partitionId: Int, attemptNumber: Int): 
DataWriter[Row] = {
+new PackedRowDataWriter()
+  }
+}
+
+case class PackedRowCommitMessage(rows: Array[Row]) extends 
WriterCommitMessage
--- End diff --

add docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161967603
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala
 ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.sources.v2.DataSourceV2Options
+import org.apache.spark.sql.sources.v2.writer.{DataSourceV2Writer, 
DataWriterFactory, WriterCommitMessage}
+import org.apache.spark.sql.types.StructType
+
+class ConsoleWriter(batchId: Long, schema: StructType, options: 
DataSourceV2Options)
+extends DataSourceV2Writer with Logging {
+  // Number of rows to display, by default 20 rows
+  private val numRowsToShow = options.getInt("numRows", 20)
+
+  // Truncate the displayed data if it is too long, by default it is true
+  private val isTruncated = options.getBoolean("truncate", true)
+
+  assert(SparkSession.getActiveSession.isDefined)
+  private val spark = SparkSession.getActiveSession.get
+
+  override def createWriterFactory(): DataWriterFactory[Row] = 
PackedRowWriterFactory
+
+  override def commit(messages: Array[WriterCommitMessage]): Unit = 
synchronized {
+val batch = messages.collect {
+  case PackedRowCommitMessage(rows) => rows
+}.fold(Array())(_ ++ _)
--- End diff --

Why this complicated fold? Just `array.collect { ... }` returns an Array .. 
isnt it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

2018-01-16 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20280
  
Thanks @HyukjinKwon and @MrBago for reviewing.  After thinking about this 
some more, I don't think this is the right solution.  Like @HyukjinKwon pointed 
out, the supplied schema names should always override any specified from `Row` 
even if it was made from kwargs.  So that means `toDF` must go by position, and 
for `Row` with kwargs that is with field names sorted.  That seems a little 
strange but I believe it is mostly due to a Python limitation of kwargs not 
being in a specific order and I don't know if there is much we can do about it.

Something still seems wrong though because the example @MrBago has in the 
JIRA is inconsistent.  I'll go over it again tomorrow, when I'm more awake..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161966917
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala
 ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.sources.v2.DataSourceV2Options
+import org.apache.spark.sql.sources.v2.writer.{DataSourceV2Writer, 
DataWriterFactory, WriterCommitMessage}
+import org.apache.spark.sql.types.StructType
+
+class ConsoleWriter(batchId: Long, schema: StructType, options: 
DataSourceV2Options)
--- End diff --

add docs and link it to the ConsoleSinkProvider since it's in a different 
file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161966755
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
 ---
@@ -54,7 +54,7 @@ class ContinuousExecution(
 sparkSession, name, checkpointRoot, analyzedPlan, sink,
 trigger, triggerClock, outputMode, deleteCheckpointOnStop) {
 
-  @volatile protected var continuousSources: Seq[ContinuousReader] = _
+  @volatile protected var continuousSources: Seq[ContinuousReader] = Seq()
--- End diff --

why this change. is it related to this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...

2018-01-16 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20243#discussion_r161966696
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala
 ---
@@ -69,7 +69,7 @@ class ContinuousExecution(
   ContinuousExecutionRelation(source, extraReaderOptions, 
output)(sparkSession)
 })
   case StreamingRelationV2(_, sourceName, _, _, _) =>
-throw new AnalysisException(
+throw new UnsupportedOperationException(
--- End diff --

Why this change? An incorrect data source is not an operation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20282: [SPARK-23093][SS] Don't change run id when reconfiguring...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20282
  
**[Test build #86250 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86250/testReport)**
 for PR 20282 at commit 
[`70917a5`](https://github.com/apache/spark/commit/70917a59c052f4b9a3859b03528194b9d16d384e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20286: [SPARK-23119][SS] Minor fixes to V2 streaming APIs

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20286
  
**[Test build #86249 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86249/testReport)**
 for PR 20286 at commit 
[`f6dfa58`](https://github.com/apache/spark/commit/f6dfa58e2bc47f447c02d54a834e164e70083db9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema tes...

2018-01-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20266


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20286: [SPARK-23119][SS] Minor fixes to V2 streaming APIs

2018-01-16 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20286
  
jenkins retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20282: [SPARK-23093][SS] Don't change run id when reconfiguring...

2018-01-16 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20282
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20266
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20289: Fix merge between 07ae39d0ec and 1667057851

2018-01-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20289


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161965369
  
--- Diff: python/pyspark/sql/session.py ---
@@ -778,6 +778,146 @@ def __exit__(self, exc_type, exc_val, exc_tb):
 self.stop()
 
 
+class UDFRegistration(object):
+"""Wrapper for user-defined function registration."""
+
+def __init__(self, sparkSession):
+self.sparkSession = sparkSession
+
+@ignore_unicode_prefix
+def register(self, name, f, returnType=None):
--- End diff --

shall we add `since 2.3`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20289: Fix merge between 07ae39d0ec and 1667057851

2018-01-16 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20289
  
LGTM. Merging to master and 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161965278
  
--- Diff: python/pyspark/sql/context.py ---
@@ -624,6 +536,9 @@ def _test():
 globs['os'] = os
 globs['sc'] = sc
 globs['sqlContext'] = SQLContext(sc)
+# 'spark' alias is a small hack for reusing doctests. Please see the 
reassignment
+# of docstrings above.
+globs['spark'] = globs['sqlContext']
--- End diff --

shall we do `globs['spark'] = globs['sqlContext'].sparkSession`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20266
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86227/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20266
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20266
  
**[Test build #86227 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86227/testReport)**
 for PR 20266 at commit 
[`c67809c`](https://github.com/apache/spark/commit/c67809c9dbe0a21011649dceededa84d73377d1c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

2018-01-16 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20280#discussion_r161964785
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
 self.assertEqual(df.schema.simpleString(), 
"struct")
 self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for 
i in range(100)])
 
-# field names can differ.
-df = rdd.toDF(" a: int, b: string ")
--- End diff --

Yeah, I think you're right. The schema should be able to override names no 
matter what.  I was thinking it was flawed because it relied on the row field 
to be sorted internally, so just changing the kwarg names (not the order) could 
cause it to fail.  That seems a little strange, but maybe it is the intent?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20288#discussion_r161964738
  
--- Diff: python/pyspark/sql/session.py ---
@@ -778,6 +778,146 @@ def __exit__(self, exc_type, exc_val, exc_tb):
 self.stop()
 
 
+class UDFRegistration(object):
--- End diff --

shall we put it in `udf.py`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20288
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86247/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20288
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20288
  
**[Test build #86247 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86247/testReport)**
 for PR 20288 at commit 
[`08438ee`](https://github.com/apache/spark/commit/08438ee7d8c209a2dcb3eb4efeeef77451feb8d7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13143: [SPARK-15359] [Mesos] Mesos dispatcher should handle DRI...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13143
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161964178
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, numRows: Int, width: Int): Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, title: String, expr: String): 
Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" 
else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Native ORC Vectorized ${if (pushDownEnabled) 
s"(Pushdown)" else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
+Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
+
+Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
+
---
+Parquet Vectorized2091 / 2258  0.5 
   1993.9   1.0X
+Parquet Vectorized (Pushdown)   41 /   44   

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161963925
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, numRows: Int, width: Int): Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, title: String, expr: String): 
Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" 
else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Native ORC Vectorized ${if (pushDownEnabled) 
s"(Pushdown)" else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
+Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
+
+Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
+
---
+Parquet Vectorized2091 / 2258  0.5 
   1993.9   1.0X
+Parquet Vectorized (Pushdown)   41 /   44   

[GitHub] spark pull request #20290: Testing ssuchter prb branch8

2018-01-16 Thread ssuchter
Github user ssuchter closed the pull request at:

https://github.com/apache/spark/pull/20290


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20287: [SPARK-23121][WEB-UI] When the Spark Streaming app is ru...

2018-01-16 Thread guoxiaolongzte
Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/20287
  
@smurakozi 
Help review the code, this bug results from your added functionality. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20265
  
LGTM except one comment. Let's worry about row group/stripe size later, 
since both parquet and orc use default settings, I think it's still fair.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20290: Testing ssuchter prb branch8

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20290
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20290: Testing ssuchter prb branch8

2018-01-16 Thread ssuchter
GitHub user ssuchter opened a pull request:

https://github.com/apache/spark/pull/20290

Testing ssuchter prb branch8

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ssuchter/spark testing-ssuchter-prb-branch8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20290


commit b8322587bbab0e3c0c9958657f4aeae1cc748453
Author: Sean Suchter 
Date:   2018-01-11T19:11:24Z

Comment out git comment calls

commit 7b8848b0371ddd35a2606520cb2e256285c30fa7
Author: Sean Suchter 
Date:   2018-01-17T01:43:45Z

Merge branch 'master' of git://github.com/apache/spark

commit 5ee2aaaf3b02c4ab4e1f2713882645cece53440d
Author: Sean Suchter 
Date:   2018-01-17T06:09:09Z

Dummy change 8




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161963288
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, numRows: Int, width: Int): Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, title: String, expr: String): 
Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Parquet Vectorized ${if (pushDownEnabled) s"(Pushdown)" 
else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"Native ORC Vectorized ${if (pushDownEnabled) 
s"(Pushdown)" else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$pushDownEnabled") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
+Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
+
+Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
+
---
+Parquet Vectorized2091 / 2258  0.5 
   1993.9   1.0X
+Parquet Vectorized (Pushdown)   41 /   44   

[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...

2018-01-16 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17280
  
Thanks for taking a look @MLnick 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17280: [SPARK-19939] [ML] Add support for association ru...

2018-01-16 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17280#discussion_r161962624
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -319,9 +323,11 @@ object FPGrowthModel extends MLReadable[FPGrowthModel] 
{
 
 override def load(path: String): FPGrowthModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  implicit val format = DefaultFormats
+  val numTrainingRecords = (metadata.metadata \ 
"numTrainingRecords").extract[Long]
--- End diff --

Yes it does now. If saving numTrainingRecords to FPGrowthModel looks good, 
I can update the load logic.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20201: [SPARK-22389][SQL] data source v2 partitioning reporting...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20201
  
**[Test build #86248 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86248/testReport)**
 for PR 20201 at commit 
[`713140a`](https://github.com/apache/spark/commit/713140af68aa22f06b3bfc5fa28bbc32ce9efa6e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20201: [SPARK-22389][SQL] data source v2 partitioning reporting...

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20201
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20287: [SPARK-23121][WEB-UI] When the Spark Streaming app is ru...

2018-01-16 Thread guoxiaolongzte
Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/20287
  
Well, then you can tell me how specific changes? I do not have a good idea 
right now. The problem is that the page crashes, it should be a fatal bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20288
  
**[Test build #86247 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86247/testReport)**
 for PR 20288 at commit 
[`08438ee`](https://github.com/apache/spark/commit/08438ee7d8c209a2dcb3eb4efeeef77451feb8d7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20223
  
We didn't run the yarn test with this PR due to 
https://issues.apache.org/jira/browse/SPARK-10300

This indicates that it might be a bad idea to skip the yarn test if we 
don't change yarn code. In this case, we touch the code in spark code and cause 
the failure in yarn test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20216: [SPARK-23024][WEB-UI]Spark ui about the contents of the ...

2018-01-16 Thread guoxiaolongzte
Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/20216
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20259: [SPARK-23066][WEB-UI] Master Page increase master...

2018-01-16 Thread guoxiaolongzte
Github user guoxiaolongzte commented on a diff in the pull request:

https://github.com/apache/spark/pull/20259#discussion_r161959792
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -179,6 +181,7 @@ private[deploy] class Master(
 }
 persistenceEngine = persistenceEngine_
 leaderElectionAgent = leaderElectionAgent_
+startupTime = System.currentTimeMillis()
--- End diff --

I understand what you mean.
I do not always query, but developers occasionally to observe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20281: [SPARK-23089][STS] Recreate session log directory if it ...

2018-01-16 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20281
  
LGTM. So looks like the fix is exactly the same as Hive.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20277
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20277
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86222/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20277
  
**[Test build #86222 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86222/testReport)**
 for PR 20277 at commit 
[`3cba91b`](https://github.com/apache/spark/commit/3cba91bf89fe1eb7c607d07facb5eea916dd7908).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class OrcColumnVector implements 
org.apache.spark.sql.vectorized.ColumnVector `
  * `public abstract class WritableColumnVector implements ColumnVector `
  * `public final class ArrowColumnVector implements ColumnVector `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when ...

2018-01-16 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20184
  
>I think that a lazy buffer allocation can not thoroughly solve this 
problem because UnsafeSorterSpillReader has BufferedFileInputStream witch will 
allocate off heap memory.

Can you please explain more. From my understanding the off heap memory in 
`BufferedFileInputStream` is the key issue for your scenario here. I don't 
think the logics you changed in `ChainedIterator` matters a lot. So a lazy 
allocation of off-heap memory should be enough IIUC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20257
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20257
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86244/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20257
  
**[Test build #86244 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86244/testReport)**
 for PR 20257 at commit 
[`e57d9ee`](https://github.com/apache/spark/commit/e57d9ee605256731f4514f8d381f985696ff8dea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20289: Fix merge between 07ae39d0ec and 1667057851

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20289
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86239/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20285
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20285
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86221/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20289: Fix merge between 07ae39d0ec and 1667057851

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20289
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20289: Fix merge between 07ae39d0ec and 1667057851

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20289
  
**[Test build #86239 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86239/testReport)**
 for PR 20289 at commit 
[`baee3c2`](https://github.com/apache/spark/commit/baee3c23c7383dc6c47cbc338c17dbaa69bd42dd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaSourceSuiteBase extends KafkaSourceTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20285
  
**[Test build #86221 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86221/testReport)**
 for PR 20285 at commit 
[`85d0db0`](https://github.com/apache/spark/commit/85d0db07641d8d39a87129995367efad44dba56f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-16 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20257
  
@MLnick @WeichenXu123 Your comments are addressed. Please check this again. 
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20277
  
**[Test build #86246 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86246/testReport)**
 for PR 20277 at commit 
[`f3f9d5e`](https://github.com/apache/spark/commit/f3f9d5e6e696c9d4a055a3fef57de45dd67c3ac4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20277
  
**[Test build #86245 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86245/testReport)**
 for PR 20277 at commit 
[`77e8c4b`](https://github.com/apache/spark/commit/77e8c4b558fb039d2f8f1cc08fa8eef7a578105e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r161957896
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -33,18 +33,6 @@
   private final ArrowVectorAccessor accessor;
   private ArrowColumnVector[] childColumns;
 
-  private void ensureAccessible(int index) {
-ensureAccessible(index, 1);
-  }
-
-  private void ensureAccessible(int index, int count) {
--- End diff --

ColumnVector is a performance critical place, we don't need index checking 
here, like other column vector implementations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20257
  
**[Test build #86244 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86244/testReport)**
 for PR 20257 at commit 
[`e57d9ee`](https://github.com/apache/spark/commit/e57d9ee605256731f4514f8d381f985696ff8dea).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20277
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86240/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20277
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20277
  
**[Test build #86240 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86240/testReport)**
 for PR 20277 at commit 
[`08d06a7`](https://github.com/apache/spark/commit/08d06a7a6cc454bc542e5b6e05dd32d8a6dd914c).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19247: [Spark-21996][SQL] read files with space in name for str...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19247
  
**[Test build #86243 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86243/testReport)**
 for PR 19247 at commit 
[`10106b3`](https://github.com/apache/spark/commit/10106b3213da1abd205c993a40f4ce025234f3e1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20277
  
**[Test build #86242 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86242/testReport)**
 for PR 20277 at commit 
[`bc6d0af`](https://github.com/apache/spark/commit/bc6d0af8d8bc13f8e538f6ea67b38d2f718186de).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20277
  
@hvanhovell good idea. I ran the `ColumnarBatchBenchmark` and found 
`getArray` has a regression as it's not final anymore. I've reverted the 
interface stuff and `ColumnVector` is still abstract class now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19247: [Spark-21996][SQL] read files with space in name for str...

2018-01-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19247
  
**[Test build #86241 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86241/testReport)**
 for PR 19247 at commit 
[`04c2b14`](https://github.com/apache/spark/commit/04c2b1443710609817c21d4707d589b6fc1af2de).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class AddTextFileData(content: String, src: File, tmp: File, 
tmpFileNamePrefix: String = \"text\")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19247: [Spark-21996][SQL] read files with space in name for str...

2018-01-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19247
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86241/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >