date:20180109

[GitHub] spark issue #20191: [SPARK-22997] Add additional defenses against use of fre...

2018-01-09 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/20191
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20191: [SPARK-22997] Add additional defenses against use of fre...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20191
  
**[Test build #85895 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85895/testReport)**
 for PR 20191 at commit 
[`a7f8c07`](https://github.com/apache/spark/commit/a7f8c07fb5158f39bbb6cc1f23cfb13a0d473536).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85888 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85888/testReport)**
 for PR 20096 at commit 
[`9158af2`](https://github.com/apache/spark/commit/9158af23dfff674641143b892f0f1093814035a3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread merlintang

Github user merlintang commented on the issue:

https://github.com/apache/spark/pull/19885
  
@steveloughran can you review the added system test cases? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...

2018-01-09 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
I believe the issue with testthat he meatn is, it fails to run tests with 
testthat 2.0.0 is released. Just for a reminder, I rushed to update 
`appveyor.yml` to use a fixed version 
https://github.com/apache/spark/pull/20003 as it broke the tests. There are 
some details there to remember back.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85875/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #85889 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85889/testReport)**
 for PR 19885 at commit 
[`31cfb88`](https://github.com/apache/spark/commit/31cfb88acfd17a3ce4481fcf32f6f2470a932bc1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85889/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20211
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20211
  
**[Test build #85879 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85879/testReport)**
 for PR 20211 at commit 
[`46dc9e1`](https://github.com/apache/spark/commit/46dc9e18f36dc14915e87ba206dd0614d0618dad).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20213
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85890/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...

2018-01-09 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20150
  
@xuanyuanking could you post the full stack trace about this issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20174
  
**[Test build #85883 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85883/testReport)**
 for PR 20174 at commit 
[`9fd817d`](https://github.com/apache/spark/commit/9fd817ddbaec971ba017c619823c01cfc21d015b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20174
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20174
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85883/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20196: [SPARK-23000] Fix Flaky test suite DataSourceWithHiveMet...

2018-01-09 Thread sameeragarwal

Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/20196
  
@gatorsmile unfortunately, this is still failing: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread jose-torres

Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160569877
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -237,85 +378,67 @@ class KafkaSourceSuite extends KafkaSourceTest {
 }
   }
 
-  test("(de)serialization of initial offsets") {
+  test("KafkaSource with watermark") {
+val now = System.currentTimeMillis()
 val topic = newTopic()
-testUtils.createTopic(topic, partitions = 64)
+testUtils.createTopic(newTopic(), partitions = 1)
+testUtils.sendMessages(topic, Array(1).map(_.toString))
 
-val reader = spark
+val kafka = spark
   .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", testUtils.brokerAddress)
+  .option("kafka.metadata.max.age.ms", "1")
+  .option("startingOffsets", s"earliest")
   .option("subscribe", topic)
+  .load()
 
-testStream(reader.load)(
-  makeSureGetOffsetCalled,
-  StopStream,
-  StartStream(),
-  StopStream)
+val windowedAggregation = kafka
+  .withWatermark("timestamp", "10 seconds")
+  .groupBy(window($"timestamp", "5 seconds") as 'window)
+  .agg(count("*") as 'count)
+  .select($"window".getField("start") as 'window, $"count")
+
+val query = windowedAggregation
+  .writeStream
+  .format("memory")
+  .outputMode("complete")
+  .queryName("kafkaWatermark")
+  .start()
+query.processAllAvailable()
+val rows = spark.table("kafkaWatermark").collect()
+assert(rows.length === 1, s"Unexpected results: ${rows.toList}")
+val row = rows(0)
+// We cannot check the exact window start time as it depands on the 
time that messages were
+// inserted by the producer. So here we just use a low bound to make 
sure the internal
+// conversion works.
+assert(
+  row.getAs[java.sql.Timestamp]("window").getTime >= now - 5 * 1000,
+  s"Unexpected results: $row")
+assert(row.getAs[Int]("count") === 1, s"Unexpected results: $row")
+query.stop()
   }
+}
 
-  test("maxOffsetsPerTrigger") {
+class KafkaSourceSuiteBase extends KafkaSourceTest {
+
+  import testImplicits._
+
+  test("(de)serialization of initial offsets") {
 val topic = newTopic()
-testUtils.createTopic(topic, partitions = 3)
-testUtils.sendMessages(topic, (100 to 200).map(_.toString).toArray, 
Some(0))
-testUtils.sendMessages(topic, (10 to 20).map(_.toString).toArray, 
Some(1))
-testUtils.sendMessages(topic, Array("1"), Some(2))
+testUtils.createTopic(topic, partitions = 5)
 
 val reader = spark
   .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", testUtils.brokerAddress)
-  .option("kafka.metadata.max.age.ms", "1")
-  .option("maxOffsetsPerTrigger", 10)
   .option("subscribe", topic)
-  .option("startingOffsets", "earliest")
-val kafka = reader.load()
-  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
-  .as[(String, String)]
-val mapped: org.apache.spark.sql.Dataset[_] = kafka.map(kv => 
kv._2.toInt)
-
-val clock = new StreamManualClock
-
-val waitUntilBatchProcessed = AssertOnQuery { q =>
-  eventually(Timeout(streamingTimeout)) {
-if (!q.exception.isDefined) {
-  assert(clock.isStreamWaitingAt(clock.getTimeMillis()))
-}
-  }
-  if (q.exception.isDefined) {
-throw q.exception.get
-  }
-  true
-}
 
-testStream(mapped)(
-  StartStream(ProcessingTime(100), clock),
-  waitUntilBatchProcessed,
-  // 1 from smallest, 1 from middle, 8 from biggest
-  CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107),
-  AdvanceManualClock(100),
-  waitUntilBatchProcessed,
-  // smallest now empty, 1 more from middle, 9 more from biggest
-  CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107,
-11, 108, 109, 110, 111, 112, 113, 114, 115, 116
-  ),
+testStream(reader.load)(
+  makeSureGetOffsetCalled,
   StopStream,
-  StartStream(ProcessingTime(100), clock),
-  waitUntilBatchProcessed,
-  // smallest now empty, 1 more from middle, 9 more from biggest
-  CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107,
-11, 108, 109, 110, 111, 112, 113, 114, 115, 116,
-12, 117, 118, 119, 120, 121, 122, 123, 124, 125
-  ),
-  AdvanceManualClock(100),
-  waitUntilBatchProcessed,
-  //

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #85889 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85889/testReport)**
 for PR 19885 at commit 
[`31cfb88`](https://github.com/apache/spark/commit/31cfb88acfd17a3ce4481fcf32f6f2470a932bc1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pa...

2018-01-09 Thread BryanCutler

GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/20213

[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series 
assignment

## What changes were proposed in this pull request?

This fixes createDataFrame from Pandas to only assign modified timestamp 
series back to a copied version of the Pandas DataFrame.  Previously, if the 
Pandas DataFrame was only a reference (e.g. a slice of another) each series 
will still get assigned back to the reference even if it is not a modified 
timestamp column.  This caused the following warning "SettingWithCopyWarning: A 
value is trying to be set on a copy of a slice from a DataFrame."

## How was this patch tested?

existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
pyspark-createDataFrame-copy-slice-warn-SPARK-23018

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20213


commit bdeead620783df3d5b39897cba7001105b2816a7
Author: Bryan Cutler 
Date:   2018-01-09T23:51:25Z

Changed createDataFrame to only assign series if modified timestamp field




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20213
  
**[Test build #85890 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85890/testReport)**
 for PR 20213 at commit 
[`bdeead6`](https://github.com/apache/spark/commit/bdeead620783df3d5b39897cba7001105b2816a7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85875 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85875/testReport)**
 for PR 20096 at commit 
[`f825155`](https://github.com/apache/spark/commit/f8251552398f980768b23059c1bbbd028cfee859).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...

2018-01-09 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20213
  
ping @ueshin @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...

2018-01-09 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20213
  
repro to get the warning (this is the non-Arrow code path)

```python
import numpy as np
import pandas as pd
pdf = pd.DataFrame(np.random.rand(100, 2))
df = spark.createDataFrame(pdf[:10])

'''
/home/bryan/git/spark/python/pyspark/sql/session.py:476: 
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: 
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  pdf[column] = s
'''
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160550392
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -237,85 +378,67 @@ class KafkaSourceSuite extends KafkaSourceTest {
 }
   }
 
-  test("(de)serialization of initial offsets") {
+  test("KafkaSource with watermark") {
+val now = System.currentTimeMillis()
 val topic = newTopic()
-testUtils.createTopic(topic, partitions = 64)
+testUtils.createTopic(newTopic(), partitions = 1)
+testUtils.sendMessages(topic, Array(1).map(_.toString))
 
-val reader = spark
+val kafka = spark
   .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", testUtils.brokerAddress)
+  .option("kafka.metadata.max.age.ms", "1")
+  .option("startingOffsets", s"earliest")
   .option("subscribe", topic)
+  .load()
 
-testStream(reader.load)(
-  makeSureGetOffsetCalled,
-  StopStream,
-  StartStream(),
-  StopStream)
+val windowedAggregation = kafka
+  .withWatermark("timestamp", "10 seconds")
+  .groupBy(window($"timestamp", "5 seconds") as 'window)
+  .agg(count("*") as 'count)
+  .select($"window".getField("start") as 'window, $"count")
+
+val query = windowedAggregation
+  .writeStream
+  .format("memory")
+  .outputMode("complete")
+  .queryName("kafkaWatermark")
+  .start()
+query.processAllAvailable()
+val rows = spark.table("kafkaWatermark").collect()
+assert(rows.length === 1, s"Unexpected results: ${rows.toList}")
+val row = rows(0)
+// We cannot check the exact window start time as it depands on the 
time that messages were
+// inserted by the producer. So here we just use a low bound to make 
sure the internal
+// conversion works.
+assert(
+  row.getAs[java.sql.Timestamp]("window").getTime >= now - 5 * 1000,
+  s"Unexpected results: $row")
+assert(row.getAs[Int]("count") === 1, s"Unexpected results: $row")
+query.stop()
   }
+}
 
-  test("maxOffsetsPerTrigger") {
+class KafkaSourceSuiteBase extends KafkaSourceTest {
+
+  import testImplicits._
+
+  test("(de)serialization of initial offsets") {
--- End diff --

Is this needed in the common KafkaSourceSuiteBase?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160559942
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala
 ---
@@ -0,0 +1,348 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Locale
+import java.util.concurrent.atomic.AtomicInteger
+
+import org.apache.kafka.clients.producer.ProducerConfig
+import org.apache.kafka.common.serialization.ByteArraySerializer
+import org.scalatest.time.SpanSugar._
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.{AnalysisException, DataFrame, Row, SaveMode}
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
SpecificInternalRow, UnsafeProjection}
+import org.apache.spark.sql.streaming._
+import org.apache.spark.sql.types.{BinaryType, DataType}
+import org.apache.spark.util.Utils
+
+/**
+ * This is a temporary port of KafkaSinkSuite, since we do not yet have a 
V2 memory stream.
+ * Once we have one, this will be changed to a specialization of 
KafkaSinkSuite and we won't have
+ * to duplicate all the code.
+ */
+class KafkaContinuousSinkSuite extends KafkaContinuousTest {
+  import testImplicits._
+
+  override val streamingTimeout = 30.seconds
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+testUtils = new KafkaTestUtils(
+  withBrokerProps = Map("auto.create.topics.enable" -> "false"))
+testUtils.setup()
+  }
+
+  override def afterAll(): Unit = {
+if (testUtils != null) {
+  testUtils.teardown()
+  testUtils = null
+}
+super.afterAll()
+  }
+
+  test("streaming - write to kafka with topic field") {
+val inputTopic = newTopic()
+testUtils.createTopic(inputTopic, partitions = 1)
+
+val input = spark
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", testUtils.brokerAddress)
+  .option("subscribe", inputTopic)
+  .option("startingOffsets", "earliest")
+  .load()
+
+val topic = newTopic()
+testUtils.createTopic(topic)
+
+val writer = createKafkaWriter(
+  input.toDF(),
+  withTopic = None,
+  withOutputMode = Some(OutputMode.Append))(
+  withSelectExpr = s"'$topic' as topic", "value")
+
+val reader = createKafkaReader(topic)
+  .selectExpr("CAST(key as STRING) key", "CAST(value as STRING) value")
+  .selectExpr("CAST(key as INT) key", "CAST(value as INT) value")
+  .as[(Int, Int)]
+  .map(_._2)
+
+try {
+  testUtils.sendMessages(inputTopic, Array("1", "2", "3", "4", "5"))
+  failAfter(streamingTimeout) {
+writer.processAllAvailable()
+  }
+  checkDatasetUnorderly(reader, 1, 2, 3, 4, 5)
+  testUtils.sendMessages(inputTopic, Array("6", "7", "8", "9", "10"))
+  failAfter(streamingTimeout) {
+writer.processAllAvailable()
+  }
+  checkDatasetUnorderly(reader, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
+} finally {
+  writer.stop()
+}
+  }
+
+  test("streaming - write data with bad schema") {
--- End diff --

missing tests for ."w/o topic field, with topic option" and "topic field 
and topic option". 
and also test for the case when topic field is null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160549136
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -49,28 +52,37 @@ abstract class KafkaSourceTest extends StreamTest with 
SharedSQLContext {
 
   override val streamingTimeout = 30.seconds
 
+  protected val brokerProps = Map[String, Object]()
+
   override def beforeAll(): Unit = {
 super.beforeAll()
-testUtils = new KafkaTestUtils
+testUtils = new KafkaTestUtils(brokerProps)
 testUtils.setup()
   }
 
   override def afterAll(): Unit = {
 if (testUtils != null) {
   testUtils.teardown()
   testUtils = null
-  super.afterAll()
 }
+super.afterAll()
   }
 
   protected def makeSureGetOffsetCalled = AssertOnQuery { q =>
 // Because KafkaSource's initialPartitionOffsets is set lazily, we 
need to make sure
-// its "getOffset" is called before pushing any data. Otherwise, 
because of the race contion,
+// its "getOffset" is called before pushing any data. Otherwise, 
because of the race contOOion,
--- End diff --

spelling mistake?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160516280
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Properties
+import java.util.concurrent.atomic.AtomicInteger
+
+import org.scalatest.time.SpanSugar._
+import scala.collection.mutable
+import scala.util.Random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.StreamExecution
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+trait KafkaContinuousTest extends KafkaSourceTest {
+  override val defaultTrigger = Trigger.Continuous(1000)
+  override val defaultUseV2Sink = true
+
+  // We need more than the default local[2] to be able to schedule all 
partitions simultaneously.
+  override protected def createSparkSession = new TestSparkSession(
+new SparkContext(
+  "local[10]",
+  "continuous-stream-test-sql-context",
+  sparkConf.set("spark.sql.testkey", "true")))
+
+  override protected def setTopicPartitions(
--- End diff --

Add comment on what this method does. It is asserting something, so does 
not look like it only "sets" something.
  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160546593
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Properties
+import java.util.concurrent.atomic.AtomicInteger
+
+import org.scalatest.time.SpanSugar._
+import scala.collection.mutable
+import scala.util.Random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.StreamExecution
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+trait KafkaContinuousTest extends KafkaSourceTest {
+  override val defaultTrigger = Trigger.Continuous(1000)
+  override val defaultUseV2Sink = true
+
+  // We need more than the default local[2] to be able to schedule all 
partitions simultaneously.
+  override protected def createSparkSession = new TestSparkSession(
+new SparkContext(
+  "local[10]",
+  "continuous-stream-test-sql-context",
+  sparkConf.set("spark.sql.testkey", "true")))
+
+  // In addition to setting the partitions in Kafka, we have to wait until 
the query has
+  // reconfigured to the new count so the test framework can hook in 
properly.
+  override protected def setTopicPartitions(
+  topic: String, newCount: Int, query: StreamExecution) = {
+testUtils.addPartitions(topic, newCount)
+eventually(timeout(streamingTimeout)) {
+  assert(
+query.lastExecution.logical.collectFirst {
+  case DataSourceV2Relation(_, r: KafkaContinuousReader) => r
+}.exists(_.knownPartitions.size == newCount),
+s"query never reconfigured to $newCount partitions")
+}
+  }
+
+  test("ensure continuous stream is being used") {
+val query = spark.readStream
+  .format("rate")
+  .option("numPartitions", "1")
+  .option("rowsPerSecond", "1")
+  .load()
+
+testStream(query)(
+  Execute(q => assert(q.isInstanceOf[ContinuousExecution]))
+)
+  }
+}
+
+class KafkaContinuousSourceSuite extends KafkaSourceSuiteBase with 
KafkaContinuousTest {
--- End diff --

The `{  }` may not be needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160546455
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Properties
+import java.util.concurrent.atomic.AtomicInteger
+
+import org.scalatest.time.SpanSugar._
+import scala.collection.mutable
+import scala.util.Random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.StreamExecution
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+trait KafkaContinuousTest extends KafkaSourceTest {
+  override val defaultTrigger = Trigger.Continuous(1000)
+  override val defaultUseV2Sink = true
+
+  // We need more than the default local[2] to be able to schedule all 
partitions simultaneously.
+  override protected def createSparkSession = new TestSparkSession(
+new SparkContext(
+  "local[10]",
+  "continuous-stream-test-sql-context",
+  sparkConf.set("spark.sql.testkey", "true")))
+
+  // In addition to setting the partitions in Kafka, we have to wait until 
the query has
+  // reconfigured to the new count so the test framework can hook in 
properly.
+  override protected def setTopicPartitions(
+  topic: String, newCount: Int, query: StreamExecution) = {
+testUtils.addPartitions(topic, newCount)
+eventually(timeout(streamingTimeout)) {
+  assert(
+query.lastExecution.logical.collectFirst {
+  case DataSourceV2Relation(_, r: KafkaContinuousReader) => r
+}.exists(_.knownPartitions.size == newCount),
+s"query never reconfigured to $newCount partitions")
+}
+  }
+
+  test("ensure continuous stream is being used") {
+val query = spark.readStream
+  .format("rate")
+  .option("numPartitions", "1")
+  .option("rowsPerSecond", "1")
+  .load()
+
+testStream(query)(
+  Execute(q => assert(q.isInstanceOf[ContinuousExecution]))
+)
+  }
+}
+
+class KafkaContinuousSourceSuite extends KafkaSourceSuiteBase with 
KafkaContinuousTest {
--- End diff --

Add docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160546427
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Properties
+import java.util.concurrent.atomic.AtomicInteger
+
+import org.scalatest.time.SpanSugar._
+import scala.collection.mutable
+import scala.util.Random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.StreamExecution
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+trait KafkaContinuousTest extends KafkaSourceTest {
--- End diff --

Add docs to explain what this class if for.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160552695
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Properties
+import java.util.concurrent.atomic.AtomicInteger
+
+import org.scalatest.time.SpanSugar._
+import scala.collection.mutable
+import scala.util.Random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.StreamExecution
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+trait KafkaContinuousTest extends KafkaSourceTest {
--- End diff --

Also since this is used not just by the source, but also the sink, better 
to define this in a different file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160552160
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -977,20 +971,8 @@ class KafkaSourceStressForDontFailOnDataLossSuite 
extends StreamTest with Shared
 }
   }
 
-  test("stress test for failOnDataLoss=false") {
-val reader = spark
-  .readStream
-  .format("kafka")
-  .option("kafka.bootstrap.servers", testUtils.brokerAddress)
-  .option("kafka.metadata.max.age.ms", "1")
-  .option("subscribePattern", "failOnDataLoss.*")
-  .option("startingOffsets", "earliest")
-  .option("failOnDataLoss", "false")
-  .option("fetchOffset.retryIntervalMs", "3000")
-val kafka = reader.load()
-  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
-  .as[(String, String)]
-val query = kafka.map(kv => kv._2.toInt).writeStream.foreach(new 
ForeachWriter[Int] {
+  protected def startStream(ds: Dataset[Int]) = {
--- End diff --

i think this factoring is not needed. `startStream()` is not used anywhere 
else other than in this test. So i dont see a point of refactoring it to define 
it outside the test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160552554
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala
 ---
@@ -0,0 +1,135 @@
+/*
--- End diff --

Rename this file to KafkaContinuousSourceSuite


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160550486
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -237,85 +378,67 @@ class KafkaSourceSuite extends KafkaSourceTest {
 }
   }
 
-  test("(de)serialization of initial offsets") {
+  test("KafkaSource with watermark") {
+val now = System.currentTimeMillis()
 val topic = newTopic()
-testUtils.createTopic(topic, partitions = 64)
+testUtils.createTopic(newTopic(), partitions = 1)
+testUtils.sendMessages(topic, Array(1).map(_.toString))
 
-val reader = spark
+val kafka = spark
   .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", testUtils.brokerAddress)
+  .option("kafka.metadata.max.age.ms", "1")
+  .option("startingOffsets", s"earliest")
   .option("subscribe", topic)
+  .load()
 
-testStream(reader.load)(
-  makeSureGetOffsetCalled,
-  StopStream,
-  StartStream(),
-  StopStream)
+val windowedAggregation = kafka
+  .withWatermark("timestamp", "10 seconds")
+  .groupBy(window($"timestamp", "5 seconds") as 'window)
+  .agg(count("*") as 'count)
+  .select($"window".getField("start") as 'window, $"count")
+
+val query = windowedAggregation
+  .writeStream
+  .format("memory")
+  .outputMode("complete")
+  .queryName("kafkaWatermark")
+  .start()
+query.processAllAvailable()
+val rows = spark.table("kafkaWatermark").collect()
+assert(rows.length === 1, s"Unexpected results: ${rows.toList}")
+val row = rows(0)
+// We cannot check the exact window start time as it depands on the 
time that messages were
+// inserted by the producer. So here we just use a low bound to make 
sure the internal
+// conversion works.
+assert(
+  row.getAs[java.sql.Timestamp]("window").getTime >= now - 5 * 1000,
+  s"Unexpected results: $row")
+assert(row.getAs[Int]("count") === 1, s"Unexpected results: $row")
+query.stop()
   }
+}
 
-  test("maxOffsetsPerTrigger") {
+class KafkaSourceSuiteBase extends KafkaSourceTest {
+
+  import testImplicits._
+
+  test("(de)serialization of initial offsets") {
 val topic = newTopic()
-testUtils.createTopic(topic, partitions = 3)
-testUtils.sendMessages(topic, (100 to 200).map(_.toString).toArray, 
Some(0))
-testUtils.sendMessages(topic, (10 to 20).map(_.toString).toArray, 
Some(1))
-testUtils.sendMessages(topic, Array("1"), Some(2))
+testUtils.createTopic(topic, partitions = 5)
 
 val reader = spark
   .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", testUtils.brokerAddress)
-  .option("kafka.metadata.max.age.ms", "1")
-  .option("maxOffsetsPerTrigger", 10)
   .option("subscribe", topic)
-  .option("startingOffsets", "earliest")
-val kafka = reader.load()
-  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
-  .as[(String, String)]
-val mapped: org.apache.spark.sql.Dataset[_] = kafka.map(kv => 
kv._2.toInt)
-
-val clock = new StreamManualClock
-
-val waitUntilBatchProcessed = AssertOnQuery { q =>
-  eventually(Timeout(streamingTimeout)) {
-if (!q.exception.isDefined) {
-  assert(clock.isStreamWaitingAt(clock.getTimeMillis()))
-}
-  }
-  if (q.exception.isDefined) {
-throw q.exception.get
-  }
-  true
-}
 
-testStream(mapped)(
-  StartStream(ProcessingTime(100), clock),
-  waitUntilBatchProcessed,
-  // 1 from smallest, 1 from middle, 8 from biggest
-  CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107),
-  AdvanceManualClock(100),
-  waitUntilBatchProcessed,
-  // smallest now empty, 1 more from middle, 9 more from biggest
-  CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107,
-11, 108, 109, 110, 111, 112, 113, 114, 115, 116
-  ),
+testStream(reader.load)(
+  makeSureGetOffsetCalled,
   StopStream,
-  StartStream(ProcessingTime(100), clock),
-  waitUntilBatchProcessed,
-  // smallest now empty, 1 more from middle, 9 more from biggest
-  CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107,
-11, 108, 109, 110, 111, 112, 113, 114, 115, 116,
-12, 117, 118, 119, 120, 121, 122, 123, 124, 125
-  ),
-  AdvanceManualClock(100),
-  waitUntilBatchProcessed,
-  // smallest now

[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20211
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85879/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20213
  
**[Test build #85890 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85890/testReport)**
 for PR 20213 at commit 
[`bdeead6`](https://github.com/apache/spark/commit/bdeead620783df3d5b39897cba7001105b2816a7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20213
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20188
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85886/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20188
  
**[Test build #85886 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85886/testReport)**
 for PR 20188 at commit 
[`c966c0c`](https://github.com/apache/spark/commit/c966c0c9a7aad87f844036f69740cfb803f677ef).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20188
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread jose-torres

Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160569556
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -977,20 +971,8 @@ class KafkaSourceStressForDontFailOnDataLossSuite 
extends StreamTest with Shared
 }
   }
 
-  test("stress test for failOnDataLoss=false") {
-val reader = spark
-  .readStream
-  .format("kafka")
-  .option("kafka.bootstrap.servers", testUtils.brokerAddress)
-  .option("kafka.metadata.max.age.ms", "1")
-  .option("subscribePattern", "failOnDataLoss.*")
-  .option("startingOffsets", "earliest")
-  .option("failOnDataLoss", "false")
-  .option("fetchOffset.retryIntervalMs", "3000")
-val kafka = reader.load()
-  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
-  .as[(String, String)]
-val query = kafka.map(kv => kv._2.toInt).writeStream.foreach(new 
ForeachWriter[Int] {
+  protected def startStream(ds: Dataset[Int]) = {
--- End diff --

startStream is overridden in the continuous version of this test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread jose-torres

Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160570055
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -49,28 +52,37 @@ abstract class KafkaSourceTest extends StreamTest with 
SharedSQLContext {
 
   override val streamingTimeout = 30.seconds
 
+  protected val brokerProps = Map[String, Object]()
+
   override def beforeAll(): Unit = {
 super.beforeAll()
-testUtils = new KafkaTestUtils
+testUtils = new KafkaTestUtils(brokerProps)
 testUtils.setup()
   }
 
   override def afterAll(): Unit = {
 if (testUtils != null) {
   testUtils.teardown()
   testUtils = null
-  super.afterAll()
 }
+super.afterAll()
   }
 
   protected def makeSureGetOffsetCalled = AssertOnQuery { q =>
 // Because KafkaSource's initialPartitionOffsets is set lazily, we 
need to make sure
-// its "getOffset" is called before pushing any data. Otherwise, 
because of the race contion,
+// its "getOffset" is called before pushing any data. Otherwise, 
because of the race contOOion,
--- End diff --

I remember wondering this morning why my command-O key sequence wasn't 
working... I guess this is where it went.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20023
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20023
  
**[Test build #85863 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85863/testReport)**
 for PR 20023 at commit 
[`6701a54`](https://github.com/apache/spark/commit/6701a54068145994e10b8dd38d9a38a1be1f3674).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18991
  
**[Test build #85868 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85868/testReport)**
 for PR 18991 at commit 
[`2bc2b17`](https://github.com/apache/spark/commit/2bc2b17aba5231c6ac3e0ab7c830acc56790df9f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18991
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85868/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18991
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20179
  
**[Test build #85885 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85885/testReport)**
 for PR 20179 at commit 
[`9e5f20e`](https://github.com/apache/spark/commit/9e5f20eef09fbd357c7cc8bb19eca7d4bf5f7170).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-09 Thread Stibbons

Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160549913
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1032,35 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages, install_driver=True):
+"""
+install python packages on all executors and driver through pip. 
pip will be installed
+by default no matter using native virtualenv or conda. So it is 
guaranteed that pip is
+available if virtualenv is enabled.
+:param packages: string for single package or a list of string for 
multiple packages
+:param install_driver: whether to install packages in client
+"""
+if self._conf.get("spark.pyspark.virtualenv.enabled") != "true":
+raise RuntimeError("install_packages can only use called when "
+   "spark.pyspark.virtualenv.enabled set as 
true")
+if isinstance(packages, basestring):
+packages = [packages]
+# seems statusTracker.getExecutorInfos() will return driver + 
exeuctors, so -1 here.
+num_executors = 
len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1
+dummyRDD = self.parallelize(range(num_executors), num_executors)
+
+def _run_pip(packages, iterator):
+import pip
+pip.main(["install"] + packages)
+
+# run it in the main thread. Will do it in a separated thread after
+# https://github.com/pypa/pip/issues/2553 is fixed
+if install_driver:
+_run_pip(packages, None)
+
+import functools
+dummyRDD.foreachPartition(functools.partial(_run_pip, packages))
--- End diff --

what about making this feature experimental and so improving it gradually ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20192
  
```
$ mvn clean integration-test 
-Dspark-distro-tgz=/work/apache/spark/spark-2.3.0-SNAPSHOT-bin-2.7.3.tgz 
-DextraScalaTestArgs="-Dspark.kubernetes.test.master=k8s://https://192.168.99.100:8443
 -Dspark.docker.test.skipBuildImages=true 
-Dspark.docker.test.driverImage=spark:master 
-Dspark.docker.test.executorImage=spark:master 
-Dspark.docker.test.initContainerImage=spark:master 
-Dspark.docker.test.persistMinikube=true"
...
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ 
spark-kubernetes-integration-tests_2.11 ---
Discovery starting.
Discovery completed in 101 milliseconds.
Run starting. Expected test count is: 9
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi using the remote example jar.
- Run SparkPi with custom driver pod name, labels, annotations, and 
environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Run SparkPi using the remote example jar with a test secret mounted into 
the driver and executor pods
- Run PageRank using remote data file
Run completed in 3 minutes, 18 seconds.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20168
  
**[Test build #85880 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85880/testReport)**
 for PR 20168 at commit 
[`48eddf1`](https://github.com/apache/spark/commit/48eddf10c884f9eea368efa9001b9fefd42de9e9).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20208
  
**[Test build #85865 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85865/testReport)**
 for PR 20208 at commit 
[`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext `
  * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
  * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
  * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
  * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
  * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
  * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
  * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85865/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85862/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20212: Update rdd-programming-guide.md

2018-01-09 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20212
  
Can you spell-check the rest of the surrounding docs while at it? there are 
often a few more around, and it helps resolve these in fewer PRs instead of one 
by one


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread foxish

Github user foxish commented on the issue:

https://github.com/apache/spark/pull/20192
  
Great, thanks @vanzin. We'll probably need to add a test case using the new 
option as well - I can take care of that. 
Thanks for the change. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20204
  
I usually run tests like this:
```
SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ...
```
Does it make sense to expose similar usage for enabling coverage? Sth like
```
SPARK_TESTING_WITH_COVERAGE=1 bin/pyspark pyspark.sql.tests ...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20168
  
**[Test build #85884 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85884/testReport)**
 for PR 20168 at commit 
[`763c8a6`](https://github.com/apache/spark/commit/763c8a613ddcb0c7960c800b4fb7496964d29b11).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20013
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...

2018-01-09 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20188
  
Actually in R setCheckpointDir method is not attached to the SparkContext; 
Iâd leave it as ânot setâ or ânot set in the sessionâ

https://spark.apache.org/docs/latest/api/R/setCheckpointDir.html




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20200: [SPARK-23005][Core] Improve RDD.take on small number of ...

2018-01-09 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20200
  
LGTM, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20200: [SPARK-23005][Core] Improve RDD.take on small num...

2018-01-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20200


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20202: [MINOR] fix a typo in BroadcastJoinSuite

2018-01-09 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20202
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20202: [MINOR] fix a typo in BroadcastJoinSuite

2018-01-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20202


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20176: [SPARK-22981][SQL] Fix incorrect results of Casting Stru...

2018-01-09 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20176
  
show binary as string and cast binary to string seems different to me, 
let's stick with what it is. BTW it's pretty dangerous to change the behavior 
of cast to be different with Hive.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20176: [SPARK-22981][SQL] Fix incorrect results of Casting Stru...

2018-01-09 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20176
  
ok, I'll make a follow-up for `showString` later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20179
  
**[Test build #85885 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85885/testReport)**
 for PR 20179 at commit 
[`9e5f20e`](https://github.com/apache/spark/commit/9e5f20eef09fbd357c7cc8bb19eca7d4bf5f7170).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20179
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85885/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20179
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85888 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85888/testReport)**
 for PR 20096 at commit 
[`9158af2`](https://github.com/apache/spark/commit/9158af23dfff674641143b892f0f1093814035a3).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85888/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20207: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-09 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20207
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20205: [SPARK-16060][SQL][follow-up] add a wrapper solut...

2018-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20205#discussion_r160579252
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java
 ---
@@ -196,17 +234,26 @@ public void initBatch(
* by copying from ORC VectorizedRowBatch columns to Spark ColumnarBatch 
columns.
*/
   private boolean nextBatch() throws IOException {
-for (WritableColumnVector vector : columnVectors) {
-  vector.reset();
-}
-columnarBatch.setNumRows(0);
--- End diff --

it's moved to 
https://github.com/apache/spark/pull/20205/files#diff-e594f7295e5408c01ace8175166313b6R253


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20207: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20207
  
**[Test build #85896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85896/testReport)**
 for PR 20207 at commit 
[`f239b2b`](https://github.com/apache/spark/commit/f239b2b51f32addf82ad5454e86ecbb9bcbe65a2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #85897 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85897/testReport)**
 for PR 19885 at commit 
[`778e1ef`](https://github.com/apache/spark/commit/778e1ef903d46a46f3a389fc9b5bf8038ac7cb71).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20207: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20207
  
**[Test build #85898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85898/testReport)**
 for PR 20207 at commit 
[`9039817`](https://github.com/apache/spark/commit/9039817d516f2cfb68f9caa41374098780fde3ca).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...

2018-01-09 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20013#discussion_r160580420
  
--- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala ---
@@ -313,35 +316,68 @@ private class LiveExecutor(val executorId: String, 
_addTime: Long) extends LiveE
 
 }
 
-/** Metrics tracked per stage (both total and per executor). */
-private class MetricsTracker {
-  var executorRunTime = 0L
-  var executorCpuTime = 0L
-  var inputBytes = 0L
-  var inputRecords = 0L
-  var outputBytes = 0L
-  var outputRecords = 0L
-  var shuffleReadBytes = 0L
-  var shuffleReadRecords = 0L
-  var shuffleWriteBytes = 0L
-  var shuffleWriteRecords = 0L
-  var memoryBytesSpilled = 0L
-  var diskBytesSpilled = 0L
-
-  def update(delta: v1.TaskMetrics): Unit = {
-executorRunTime += delta.executorRunTime
-executorCpuTime += delta.executorCpuTime
-inputBytes += delta.inputMetrics.bytesRead
-inputRecords += delta.inputMetrics.recordsRead
-outputBytes += delta.outputMetrics.bytesWritten
-outputRecords += delta.outputMetrics.recordsWritten
-shuffleReadBytes += delta.shuffleReadMetrics.localBytesRead +
-  delta.shuffleReadMetrics.remoteBytesRead
-shuffleReadRecords += delta.shuffleReadMetrics.recordsRead
-shuffleWriteBytes += delta.shuffleWriteMetrics.bytesWritten
-shuffleWriteRecords += delta.shuffleWriteMetrics.recordsWritten
-memoryBytesSpilled += delta.memoryBytesSpilled
-diskBytesSpilled += delta.diskBytesSpilled
+private class MetricsTracker(
--- End diff --

ok, but the code becomes uglier because `v1.TaskMetrics` is kind of an 
annoying type to use.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20013
  
**[Test build #85899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85899/testReport)**
 for PR 20013 at commit 
[`3d66505`](https://github.com/apache/spark/commit/3d6650589223cfd93c4ced9fb25246bcd88ca899).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20205: [SPARK-16060][SQL][follow-up] add a wrapper solut...

2018-01-09 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20205#discussion_r160581341
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java
 ---
@@ -196,17 +234,26 @@ public void initBatch(
* by copying from ORC VectorizedRowBatch columns to Spark ColumnarBatch 
columns.
*/
   private boolean nextBatch() throws IOException {
-for (WritableColumnVector vector : columnVectors) {
-  vector.reset();
-}
-columnarBatch.setNumRows(0);
--- End diff --

Yep. I meant keeping here since we return at line 390 and 240.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20214: [SPARK-23023][SQL] Cast field data to strings in ...

2018-01-09 Thread maropu

GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/20214

[SPARK-23023][SQL] Cast field data to strings in showString

## What changes were proposed in this pull request?
The current `Datset.showString` prints rows thru `RowEncoder` deserializers 
like;
```
scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
++
|a   |
++
|[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]|
++
```
This result is incorrect because the correct one is;
```
scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
++
|a   |
++
|[[1, 2], [3], [4, 5, 6]]|
++
```
So, this pr fixed code in `showString` to cast field data to strings before 
printing.

## How was this patch tested?
Added tests in `DataFrameSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-23023

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20214


commit eb56aff74352a360d1d4b1273be23b670f3c958a
Author: Takeshi Yamamuro 
Date:   2018-01-06T11:05:54Z

Cast data to strings in showString




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85897/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #85897 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85897/testReport)**
 for PR 19885 at commit 
[`778e1ef`](https://github.com/apache/spark/commit/778e1ef903d46a46f3a389fc9b5bf8038ac7cb71).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20214
  
**[Test build #85900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85900/testReport)**
 for PR 20214 at commit 
[`eb56aff`](https://github.com/apache/spark/commit/eb56aff74352a360d1d4b1273be23b670f3c958a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18853
  
**[Test build #85846 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85846/testReport)**
 for PR 18853 at commit 
[`408e889`](https://github.com/apache/spark/commit/408e889caa8d61b7267f0f391be4af5fde82a0c9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18853
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18853
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85846/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20195
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85842/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20195
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2018-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r160376096
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType {
   val MAX_SCALE = 38
   val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)
   val USER_DEFAULT: DecimalType = DecimalType(10, 0)
+  val MINIMUM_ADJUSTED_SCALE = 6
--- End diff --

how about `spark.sql.decimalOperations.allowTruncat`? Let's leave the mode 
stuff to the type coercion mode.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20195
  
**[Test build #85842 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85842/testReport)**
 for PR 20195 at commit 
[`f55ace6`](https://github.com/apache/spark/commit/f55ace645b46a429a512eb8e922a7074c4cd8cc0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18853
  
**[Test build #85849 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85849/testReport)**
 for PR 18853 at commit 
[`e763330`](https://github.com/apache/spark/commit/e763330edae88d4dad410214608fb5448d90a989).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2018-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r160376186
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType {
   val MAX_SCALE = 38
   val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)
   val USER_DEFAULT: DecimalType = DecimalType(10, 0)
+  val MINIMUM_ADJUSTED_SCALE = 6
--- End diff --

We should make it an internal conf and remove it after some releases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20199: [Spark-22967][Hive]Fix VersionSuite's unit tests by chan...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20199
  
**[Test build #85847 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85847/testReport)**
 for PR 20199 at commit 
[`22669d1`](https://github.com/apache/spark/commit/22669d1ff0cb00261fa146d276af237c115a0488).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...

2018-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20189
  
**[Test build #85850 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85850/testReport)**
 for PR 20189 at commit 
[`7242eab`](https://github.com/apache/spark/commit/7242eabe00ce84cb132a4a4f16cb53bed1e6afa7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 >

201 - 300 of 517 matches

Mail list logo