[GitHub] spark issue #20191: [SPARK-22997] Add additional defenses against use of fre...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/20191 jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20191: [SPARK-22997] Add additional defenses against use of fre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20191 **[Test build #85895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85895/testReport)** for PR 20191 at commit [`a7f8c07`](https://github.com/apache/spark/commit/a7f8c07fb5158f39bbb6cc1f23cfb13a0d473536). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85888 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85888/testReport)** for PR 20096 at commit [`9158af2`](https://github.com/apache/spark/commit/9158af23dfff674641143b892f0f1093814035a3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 @steveloughran can you review the added system test cases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 I believe the issue with testthat he meatn is, it fails to run tests with testthat 2.0.0 is released. Just for a reminder, I rushed to update `appveyor.yml` to use a fixed version https://github.com/apache/spark/pull/20003 as it broke the tests. There are some details there to remember back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85875/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19885 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19885 **[Test build #85889 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85889/testReport)** for PR 19885 at commit [`31cfb88`](https://github.com/apache/spark/commit/31cfb88acfd17a3ce4481fcf32f6f2470a932bc1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19885 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85889/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20211 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20211 **[Test build #85879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85879/testReport)** for PR 20211 at commit [`46dc9e1`](https://github.com/apache/spark/commit/46dc9e18f36dc14915e87ba206dd0614d0618dad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20213 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85890/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20150 @xuanyuanking could you post the full stack trace about this issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20174 **[Test build #85883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85883/testReport)** for PR 20174 at commit [`9fd817d`](https://github.com/apache/spark/commit/9fd817ddbaec971ba017c619823c01cfc21d015b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20174 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20174 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85883/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20196: [SPARK-23000] Fix Flaky test suite DataSourceWithHiveMet...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20196 @gatorsmile unfortunately, this is still failing: https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160569877 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -237,85 +378,67 @@ class KafkaSourceSuite extends KafkaSourceTest { } } - test("(de)serialization of initial offsets") { + test("KafkaSource with watermark") { +val now = System.currentTimeMillis() val topic = newTopic() -testUtils.createTopic(topic, partitions = 64) +testUtils.createTopic(newTopic(), partitions = 1) +testUtils.sendMessages(topic, Array(1).map(_.toString)) -val reader = spark +val kafka = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", testUtils.brokerAddress) + .option("kafka.metadata.max.age.ms", "1") + .option("startingOffsets", s"earliest") .option("subscribe", topic) + .load() -testStream(reader.load)( - makeSureGetOffsetCalled, - StopStream, - StartStream(), - StopStream) +val windowedAggregation = kafka + .withWatermark("timestamp", "10 seconds") + .groupBy(window($"timestamp", "5 seconds") as 'window) + .agg(count("*") as 'count) + .select($"window".getField("start") as 'window, $"count") + +val query = windowedAggregation + .writeStream + .format("memory") + .outputMode("complete") + .queryName("kafkaWatermark") + .start() +query.processAllAvailable() +val rows = spark.table("kafkaWatermark").collect() +assert(rows.length === 1, s"Unexpected results: ${rows.toList}") +val row = rows(0) +// We cannot check the exact window start time as it depands on the time that messages were +// inserted by the producer. So here we just use a low bound to make sure the internal +// conversion works. +assert( + row.getAs[java.sql.Timestamp]("window").getTime >= now - 5 * 1000, + s"Unexpected results: $row") +assert(row.getAs[Int]("count") === 1, s"Unexpected results: $row") +query.stop() } +} - test("maxOffsetsPerTrigger") { +class KafkaSourceSuiteBase extends KafkaSourceTest { + + import testImplicits._ + + test("(de)serialization of initial offsets") { val topic = newTopic() -testUtils.createTopic(topic, partitions = 3) -testUtils.sendMessages(topic, (100 to 200).map(_.toString).toArray, Some(0)) -testUtils.sendMessages(topic, (10 to 20).map(_.toString).toArray, Some(1)) -testUtils.sendMessages(topic, Array("1"), Some(2)) +testUtils.createTopic(topic, partitions = 5) val reader = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", testUtils.brokerAddress) - .option("kafka.metadata.max.age.ms", "1") - .option("maxOffsetsPerTrigger", 10) .option("subscribe", topic) - .option("startingOffsets", "earliest") -val kafka = reader.load() - .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") - .as[(String, String)] -val mapped: org.apache.spark.sql.Dataset[_] = kafka.map(kv => kv._2.toInt) - -val clock = new StreamManualClock - -val waitUntilBatchProcessed = AssertOnQuery { q => - eventually(Timeout(streamingTimeout)) { -if (!q.exception.isDefined) { - assert(clock.isStreamWaitingAt(clock.getTimeMillis())) -} - } - if (q.exception.isDefined) { -throw q.exception.get - } - true -} -testStream(mapped)( - StartStream(ProcessingTime(100), clock), - waitUntilBatchProcessed, - // 1 from smallest, 1 from middle, 8 from biggest - CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107), - AdvanceManualClock(100), - waitUntilBatchProcessed, - // smallest now empty, 1 more from middle, 9 more from biggest - CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107, -11, 108, 109, 110, 111, 112, 113, 114, 115, 116 - ), +testStream(reader.load)( + makeSureGetOffsetCalled, StopStream, - StartStream(ProcessingTime(100), clock), - waitUntilBatchProcessed, - // smallest now empty, 1 more from middle, 9 more from biggest - CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107, -11, 108, 109, 110, 111, 112, 113, 114, 115, 116, -12, 117, 118, 119, 120, 121, 122, 123, 124, 125 - ), - AdvanceManualClock(100), - waitUntilBatchProcessed, - //
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19885 **[Test build #85889 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85889/testReport)** for PR 19885 at commit [`31cfb88`](https://github.com/apache/spark/commit/31cfb88acfd17a3ce4481fcf32f6f2470a932bc1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pa...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/20213 [SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment ## What changes were proposed in this pull request? This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame." ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark pyspark-createDataFrame-copy-slice-warn-SPARK-23018 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20213 commit bdeead620783df3d5b39897cba7001105b2816a7 Author: Bryan CutlerDate: 2018-01-09T23:51:25Z Changed createDataFrame to only assign series if modified timestamp field --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20213 **[Test build #85890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85890/testReport)** for PR 20213 at commit [`bdeead6`](https://github.com/apache/spark/commit/bdeead620783df3d5b39897cba7001105b2816a7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85875/testReport)** for PR 20096 at commit [`f825155`](https://github.com/apache/spark/commit/f8251552398f980768b23059c1bbbd028cfee859). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20213 ping @ueshin @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20213 repro to get the warning (this is the non-Arrow code path) ```python import numpy as np import pandas as pd pdf = pd.DataFrame(np.random.rand(100, 2)) df = spark.createDataFrame(pdf[:10]) ''' /home/bryan/git/spark/python/pyspark/sql/session.py:476: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy pdf[column] = s ''' ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160550392 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -237,85 +378,67 @@ class KafkaSourceSuite extends KafkaSourceTest { } } - test("(de)serialization of initial offsets") { + test("KafkaSource with watermark") { +val now = System.currentTimeMillis() val topic = newTopic() -testUtils.createTopic(topic, partitions = 64) +testUtils.createTopic(newTopic(), partitions = 1) +testUtils.sendMessages(topic, Array(1).map(_.toString)) -val reader = spark +val kafka = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", testUtils.brokerAddress) + .option("kafka.metadata.max.age.ms", "1") + .option("startingOffsets", s"earliest") .option("subscribe", topic) + .load() -testStream(reader.load)( - makeSureGetOffsetCalled, - StopStream, - StartStream(), - StopStream) +val windowedAggregation = kafka + .withWatermark("timestamp", "10 seconds") + .groupBy(window($"timestamp", "5 seconds") as 'window) + .agg(count("*") as 'count) + .select($"window".getField("start") as 'window, $"count") + +val query = windowedAggregation + .writeStream + .format("memory") + .outputMode("complete") + .queryName("kafkaWatermark") + .start() +query.processAllAvailable() +val rows = spark.table("kafkaWatermark").collect() +assert(rows.length === 1, s"Unexpected results: ${rows.toList}") +val row = rows(0) +// We cannot check the exact window start time as it depands on the time that messages were +// inserted by the producer. So here we just use a low bound to make sure the internal +// conversion works. +assert( + row.getAs[java.sql.Timestamp]("window").getTime >= now - 5 * 1000, + s"Unexpected results: $row") +assert(row.getAs[Int]("count") === 1, s"Unexpected results: $row") +query.stop() } +} - test("maxOffsetsPerTrigger") { +class KafkaSourceSuiteBase extends KafkaSourceTest { + + import testImplicits._ + + test("(de)serialization of initial offsets") { --- End diff -- Is this needed in the common KafkaSourceSuiteBase? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160559942 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala --- @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Locale +import java.util.concurrent.atomic.AtomicInteger + +import org.apache.kafka.clients.producer.ProducerConfig +import org.apache.kafka.common.serialization.ByteArraySerializer +import org.scalatest.time.SpanSugar._ +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.{AnalysisException, DataFrame, Row, SaveMode} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, SpecificInternalRow, UnsafeProjection} +import org.apache.spark.sql.streaming._ +import org.apache.spark.sql.types.{BinaryType, DataType} +import org.apache.spark.util.Utils + +/** + * This is a temporary port of KafkaSinkSuite, since we do not yet have a V2 memory stream. + * Once we have one, this will be changed to a specialization of KafkaSinkSuite and we won't have + * to duplicate all the code. + */ +class KafkaContinuousSinkSuite extends KafkaContinuousTest { + import testImplicits._ + + override val streamingTimeout = 30.seconds + + override def beforeAll(): Unit = { +super.beforeAll() +testUtils = new KafkaTestUtils( + withBrokerProps = Map("auto.create.topics.enable" -> "false")) +testUtils.setup() + } + + override def afterAll(): Unit = { +if (testUtils != null) { + testUtils.teardown() + testUtils = null +} +super.afterAll() + } + + test("streaming - write to kafka with topic field") { +val inputTopic = newTopic() +testUtils.createTopic(inputTopic, partitions = 1) + +val input = spark + .readStream + .format("kafka") + .option("kafka.bootstrap.servers", testUtils.brokerAddress) + .option("subscribe", inputTopic) + .option("startingOffsets", "earliest") + .load() + +val topic = newTopic() +testUtils.createTopic(topic) + +val writer = createKafkaWriter( + input.toDF(), + withTopic = None, + withOutputMode = Some(OutputMode.Append))( + withSelectExpr = s"'$topic' as topic", "value") + +val reader = createKafkaReader(topic) + .selectExpr("CAST(key as STRING) key", "CAST(value as STRING) value") + .selectExpr("CAST(key as INT) key", "CAST(value as INT) value") + .as[(Int, Int)] + .map(_._2) + +try { + testUtils.sendMessages(inputTopic, Array("1", "2", "3", "4", "5")) + failAfter(streamingTimeout) { +writer.processAllAvailable() + } + checkDatasetUnorderly(reader, 1, 2, 3, 4, 5) + testUtils.sendMessages(inputTopic, Array("6", "7", "8", "9", "10")) + failAfter(streamingTimeout) { +writer.processAllAvailable() + } + checkDatasetUnorderly(reader, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) +} finally { + writer.stop() +} + } + + test("streaming - write data with bad schema") { --- End diff -- missing tests for ."w/o topic field, with topic option" and "topic field and topic option". and also test for the case when topic field is null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160549136 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -49,28 +52,37 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext { override val streamingTimeout = 30.seconds + protected val brokerProps = Map[String, Object]() + override def beforeAll(): Unit = { super.beforeAll() -testUtils = new KafkaTestUtils +testUtils = new KafkaTestUtils(brokerProps) testUtils.setup() } override def afterAll(): Unit = { if (testUtils != null) { testUtils.teardown() testUtils = null - super.afterAll() } +super.afterAll() } protected def makeSureGetOffsetCalled = AssertOnQuery { q => // Because KafkaSource's initialPartitionOffsets is set lazily, we need to make sure -// its "getOffset" is called before pushing any data. Otherwise, because of the race contion, +// its "getOffset" is called before pushing any data. Otherwise, because of the race contOOion, --- End diff -- spelling mistake? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160516280 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import org.scalatest.time.SpanSugar._ +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row} +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation +import org.apache.spark.sql.execution.streaming.StreamExecution +import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +trait KafkaContinuousTest extends KafkaSourceTest { + override val defaultTrigger = Trigger.Continuous(1000) + override val defaultUseV2Sink = true + + // We need more than the default local[2] to be able to schedule all partitions simultaneously. + override protected def createSparkSession = new TestSparkSession( +new SparkContext( + "local[10]", + "continuous-stream-test-sql-context", + sparkConf.set("spark.sql.testkey", "true"))) + + override protected def setTopicPartitions( --- End diff -- Add comment on what this method does. It is asserting something, so does not look like it only "sets" something. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160546593 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import org.scalatest.time.SpanSugar._ +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row} +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation +import org.apache.spark.sql.execution.streaming.StreamExecution +import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +trait KafkaContinuousTest extends KafkaSourceTest { + override val defaultTrigger = Trigger.Continuous(1000) + override val defaultUseV2Sink = true + + // We need more than the default local[2] to be able to schedule all partitions simultaneously. + override protected def createSparkSession = new TestSparkSession( +new SparkContext( + "local[10]", + "continuous-stream-test-sql-context", + sparkConf.set("spark.sql.testkey", "true"))) + + // In addition to setting the partitions in Kafka, we have to wait until the query has + // reconfigured to the new count so the test framework can hook in properly. + override protected def setTopicPartitions( + topic: String, newCount: Int, query: StreamExecution) = { +testUtils.addPartitions(topic, newCount) +eventually(timeout(streamingTimeout)) { + assert( +query.lastExecution.logical.collectFirst { + case DataSourceV2Relation(_, r: KafkaContinuousReader) => r +}.exists(_.knownPartitions.size == newCount), +s"query never reconfigured to $newCount partitions") +} + } + + test("ensure continuous stream is being used") { +val query = spark.readStream + .format("rate") + .option("numPartitions", "1") + .option("rowsPerSecond", "1") + .load() + +testStream(query)( + Execute(q => assert(q.isInstanceOf[ContinuousExecution])) +) + } +} + +class KafkaContinuousSourceSuite extends KafkaSourceSuiteBase with KafkaContinuousTest { --- End diff -- The `{ }` may not be needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160546455 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import org.scalatest.time.SpanSugar._ +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row} +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation +import org.apache.spark.sql.execution.streaming.StreamExecution +import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +trait KafkaContinuousTest extends KafkaSourceTest { + override val defaultTrigger = Trigger.Continuous(1000) + override val defaultUseV2Sink = true + + // We need more than the default local[2] to be able to schedule all partitions simultaneously. + override protected def createSparkSession = new TestSparkSession( +new SparkContext( + "local[10]", + "continuous-stream-test-sql-context", + sparkConf.set("spark.sql.testkey", "true"))) + + // In addition to setting the partitions in Kafka, we have to wait until the query has + // reconfigured to the new count so the test framework can hook in properly. + override protected def setTopicPartitions( + topic: String, newCount: Int, query: StreamExecution) = { +testUtils.addPartitions(topic, newCount) +eventually(timeout(streamingTimeout)) { + assert( +query.lastExecution.logical.collectFirst { + case DataSourceV2Relation(_, r: KafkaContinuousReader) => r +}.exists(_.knownPartitions.size == newCount), +s"query never reconfigured to $newCount partitions") +} + } + + test("ensure continuous stream is being used") { +val query = spark.readStream + .format("rate") + .option("numPartitions", "1") + .option("rowsPerSecond", "1") + .load() + +testStream(query)( + Execute(q => assert(q.isInstanceOf[ContinuousExecution])) +) + } +} + +class KafkaContinuousSourceSuite extends KafkaSourceSuiteBase with KafkaContinuousTest { --- End diff -- Add docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160546427 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import org.scalatest.time.SpanSugar._ +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row} +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation +import org.apache.spark.sql.execution.streaming.StreamExecution +import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +trait KafkaContinuousTest extends KafkaSourceTest { --- End diff -- Add docs to explain what this class if for. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160552695 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import org.scalatest.time.SpanSugar._ +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row} +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation +import org.apache.spark.sql.execution.streaming.StreamExecution +import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +trait KafkaContinuousTest extends KafkaSourceTest { --- End diff -- Also since this is used not just by the source, but also the sink, better to define this in a different file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160552160 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -977,20 +971,8 @@ class KafkaSourceStressForDontFailOnDataLossSuite extends StreamTest with Shared } } - test("stress test for failOnDataLoss=false") { -val reader = spark - .readStream - .format("kafka") - .option("kafka.bootstrap.servers", testUtils.brokerAddress) - .option("kafka.metadata.max.age.ms", "1") - .option("subscribePattern", "failOnDataLoss.*") - .option("startingOffsets", "earliest") - .option("failOnDataLoss", "false") - .option("fetchOffset.retryIntervalMs", "3000") -val kafka = reader.load() - .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") - .as[(String, String)] -val query = kafka.map(kv => kv._2.toInt).writeStream.foreach(new ForeachWriter[Int] { + protected def startStream(ds: Dataset[Int]) = { --- End diff -- i think this factoring is not needed. `startStream()` is not used anywhere else other than in this test. So i dont see a point of refactoring it to define it outside the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160552554 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala --- @@ -0,0 +1,135 @@ +/* --- End diff -- Rename this file to KafkaContinuousSourceSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160550486 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -237,85 +378,67 @@ class KafkaSourceSuite extends KafkaSourceTest { } } - test("(de)serialization of initial offsets") { + test("KafkaSource with watermark") { +val now = System.currentTimeMillis() val topic = newTopic() -testUtils.createTopic(topic, partitions = 64) +testUtils.createTopic(newTopic(), partitions = 1) +testUtils.sendMessages(topic, Array(1).map(_.toString)) -val reader = spark +val kafka = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", testUtils.brokerAddress) + .option("kafka.metadata.max.age.ms", "1") + .option("startingOffsets", s"earliest") .option("subscribe", topic) + .load() -testStream(reader.load)( - makeSureGetOffsetCalled, - StopStream, - StartStream(), - StopStream) +val windowedAggregation = kafka + .withWatermark("timestamp", "10 seconds") + .groupBy(window($"timestamp", "5 seconds") as 'window) + .agg(count("*") as 'count) + .select($"window".getField("start") as 'window, $"count") + +val query = windowedAggregation + .writeStream + .format("memory") + .outputMode("complete") + .queryName("kafkaWatermark") + .start() +query.processAllAvailable() +val rows = spark.table("kafkaWatermark").collect() +assert(rows.length === 1, s"Unexpected results: ${rows.toList}") +val row = rows(0) +// We cannot check the exact window start time as it depands on the time that messages were +// inserted by the producer. So here we just use a low bound to make sure the internal +// conversion works. +assert( + row.getAs[java.sql.Timestamp]("window").getTime >= now - 5 * 1000, + s"Unexpected results: $row") +assert(row.getAs[Int]("count") === 1, s"Unexpected results: $row") +query.stop() } +} - test("maxOffsetsPerTrigger") { +class KafkaSourceSuiteBase extends KafkaSourceTest { + + import testImplicits._ + + test("(de)serialization of initial offsets") { val topic = newTopic() -testUtils.createTopic(topic, partitions = 3) -testUtils.sendMessages(topic, (100 to 200).map(_.toString).toArray, Some(0)) -testUtils.sendMessages(topic, (10 to 20).map(_.toString).toArray, Some(1)) -testUtils.sendMessages(topic, Array("1"), Some(2)) +testUtils.createTopic(topic, partitions = 5) val reader = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", testUtils.brokerAddress) - .option("kafka.metadata.max.age.ms", "1") - .option("maxOffsetsPerTrigger", 10) .option("subscribe", topic) - .option("startingOffsets", "earliest") -val kafka = reader.load() - .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") - .as[(String, String)] -val mapped: org.apache.spark.sql.Dataset[_] = kafka.map(kv => kv._2.toInt) - -val clock = new StreamManualClock - -val waitUntilBatchProcessed = AssertOnQuery { q => - eventually(Timeout(streamingTimeout)) { -if (!q.exception.isDefined) { - assert(clock.isStreamWaitingAt(clock.getTimeMillis())) -} - } - if (q.exception.isDefined) { -throw q.exception.get - } - true -} -testStream(mapped)( - StartStream(ProcessingTime(100), clock), - waitUntilBatchProcessed, - // 1 from smallest, 1 from middle, 8 from biggest - CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107), - AdvanceManualClock(100), - waitUntilBatchProcessed, - // smallest now empty, 1 more from middle, 9 more from biggest - CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107, -11, 108, 109, 110, 111, 112, 113, 114, 115, 116 - ), +testStream(reader.load)( + makeSureGetOffsetCalled, StopStream, - StartStream(ProcessingTime(100), clock), - waitUntilBatchProcessed, - // smallest now empty, 1 more from middle, 9 more from biggest - CheckAnswer(1, 10, 100, 101, 102, 103, 104, 105, 106, 107, -11, 108, 109, 110, 111, 112, 113, 114, 115, 116, -12, 117, 118, 119, 120, 121, 122, 123, 124, 125 - ), - AdvanceManualClock(100), - waitUntilBatchProcessed, - // smallest now
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20211 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85879/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20213 **[Test build #85890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85890/testReport)** for PR 20213 at commit [`bdeead6`](https://github.com/apache/spark/commit/bdeead620783df3d5b39897cba7001105b2816a7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20213: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas ti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20213 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20188 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85886/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20188 **[Test build #85886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85886/testReport)** for PR 20188 at commit [`c966c0c`](https://github.com/apache/spark/commit/c966c0c9a7aad87f844036f69740cfb803f677ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20188 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160569556 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -977,20 +971,8 @@ class KafkaSourceStressForDontFailOnDataLossSuite extends StreamTest with Shared } } - test("stress test for failOnDataLoss=false") { -val reader = spark - .readStream - .format("kafka") - .option("kafka.bootstrap.servers", testUtils.brokerAddress) - .option("kafka.metadata.max.age.ms", "1") - .option("subscribePattern", "failOnDataLoss.*") - .option("startingOffsets", "earliest") - .option("failOnDataLoss", "false") - .option("fetchOffset.retryIntervalMs", "3000") -val kafka = reader.load() - .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") - .as[(String, String)] -val query = kafka.map(kv => kv._2.toInt).writeStream.foreach(new ForeachWriter[Int] { + protected def startStream(ds: Dataset[Int]) = { --- End diff -- startStream is overridden in the continuous version of this test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160570055 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala --- @@ -49,28 +52,37 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext { override val streamingTimeout = 30.seconds + protected val brokerProps = Map[String, Object]() + override def beforeAll(): Unit = { super.beforeAll() -testUtils = new KafkaTestUtils +testUtils = new KafkaTestUtils(brokerProps) testUtils.setup() } override def afterAll(): Unit = { if (testUtils != null) { testUtils.teardown() testUtils = null - super.afterAll() } +super.afterAll() } protected def makeSureGetOffsetCalled = AssertOnQuery { q => // Because KafkaSource's initialPartitionOffsets is set lazily, we need to make sure -// its "getOffset" is called before pushing any data. Otherwise, because of the race contion, +// its "getOffset" is called before pushing any data. Otherwise, because of the race contOOion, --- End diff -- I remember wondering this morning why my command-O key sequence wasn't working... I guess this is where it went. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20023 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20023 **[Test build #85863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85863/testReport)** for PR 20023 at commit [`6701a54`](https://github.com/apache/spark/commit/6701a54068145994e10b8dd38d9a38a1be1f3674). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18991 **[Test build #85868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85868/testReport)** for PR 18991 at commit [`2bc2b17`](https://github.com/apache/spark/commit/2bc2b17aba5231c6ac3e0ab7c830acc56790df9f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18991 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85868/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18991 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20179 **[Test build #85885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85885/testReport)** for PR 20179 at commit [`9e5f20e`](https://github.com/apache/spark/commit/9e5f20eef09fbd357c7cc8bb19eca7d4bf5f7170). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160549913 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1032,35 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages, install_driver=True): +""" +install python packages on all executors and driver through pip. pip will be installed +by default no matter using native virtualenv or conda. So it is guaranteed that pip is +available if virtualenv is enabled. +:param packages: string for single package or a list of string for multiple packages +:param install_driver: whether to install packages in client +""" +if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": +raise RuntimeError("install_packages can only use called when " + "spark.pyspark.virtualenv.enabled set as true") +if isinstance(packages, basestring): +packages = [packages] +# seems statusTracker.getExecutorInfos() will return driver + exeuctors, so -1 here. +num_executors = len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1 +dummyRDD = self.parallelize(range(num_executors), num_executors) + +def _run_pip(packages, iterator): +import pip +pip.main(["install"] + packages) + +# run it in the main thread. Will do it in a separated thread after +# https://github.com/pypa/pip/issues/2553 is fixed +if install_driver: +_run_pip(packages, None) + +import functools +dummyRDD.foreachPartition(functools.partial(_run_pip, packages)) --- End diff -- what about making this feature experimental and so improving it gradually ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20192 ``` $ mvn clean integration-test -Dspark-distro-tgz=/work/apache/spark/spark-2.3.0-SNAPSHOT-bin-2.7.3.tgz -DextraScalaTestArgs="-Dspark.kubernetes.test.master=k8s://https://192.168.99.100:8443 -Dspark.docker.test.skipBuildImages=true -Dspark.docker.test.driverImage=spark:master -Dspark.docker.test.executorImage=spark:master -Dspark.docker.test.initContainerImage=spark:master -Dspark.docker.test.persistMinikube=true" ... [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ spark-kubernetes-integration-tests_2.11 --- Discovery starting. Discovery completed in 101 milliseconds. Run starting. Expected test count is: 9 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi using the remote example jar. - Run SparkPi with custom driver pod name, labels, annotations, and environment variables. - Run SparkPi with a test secret mounted into the driver and executor pods - Run SparkPi using the remote example jar with a test secret mounted into the driver and executor pods - Run PageRank using remote data file Run completed in 3 minutes, 18 seconds. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20168 **[Test build #85880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85880/testReport)** for PR 20168 at commit [`48eddf1`](https://github.com/apache/spark/commit/48eddf10c884f9eea368efa9001b9fefd42de9e9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #85865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85865/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext ` * `trait AddColumnEvolutionTest extends SchemaEvolutionTest ` * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest ` * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest ` * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest ` * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest ` * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest ` * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85865/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85862/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20212: Update rdd-programming-guide.md
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20212 Can you spell-check the rest of the surrounding docs while at it? there are often a few more around, and it helps resolve these in fewer PRs instead of one by one --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/20192 Great, thanks @vanzin. We'll probably need to add a test case using the new option as well - I can take care of that. Thanks for the change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20204 I usually run tests like this: ``` SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ... ``` Does it make sense to expose similar usage for enabling coverage? Sth like ``` SPARK_TESTING_WITH_COVERAGE=1 bin/pyspark pyspark.sql.tests ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20168 **[Test build #85884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85884/testReport)** for PR 20168 at commit [`763c8a6`](https://github.com/apache/spark/commit/763c8a613ddcb0c7960c800b4fb7496964d29b11). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20013 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20188 Actually in R setCheckpointDir method is not attached to the SparkContext; Iâd leave it as ânot setâ or ânot set in the sessionâ https://spark.apache.org/docs/latest/api/R/setCheckpointDir.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20200: [SPARK-23005][Core] Improve RDD.take on small number of ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20200 LGTM, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20200: [SPARK-23005][Core] Improve RDD.take on small num...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20200 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20202: [MINOR] fix a typo in BroadcastJoinSuite
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20202 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20202: [MINOR] fix a typo in BroadcastJoinSuite
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20202 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20176: [SPARK-22981][SQL] Fix incorrect results of Casting Stru...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20176 show binary as string and cast binary to string seems different to me, let's stick with what it is. BTW it's pretty dangerous to change the behavior of cast to be different with Hive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20176: [SPARK-22981][SQL] Fix incorrect results of Casting Stru...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20176 ok, I'll make a follow-up for `showString` later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20179 **[Test build #85885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85885/testReport)** for PR 20179 at commit [`9e5f20e`](https://github.com/apache/spark/commit/9e5f20eef09fbd357c7cc8bb19eca7d4bf5f7170). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20179 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85885/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20179: [SPARK-22982] Remove unsafe asynchronous close() call fr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20179 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85888 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85888/testReport)** for PR 20096 at commit [`9158af2`](https://github.com/apache/spark/commit/9158af23dfff674641143b892f0f1093814035a3). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85888/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20207: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20207 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20205: [SPARK-16060][SQL][follow-up] add a wrapper solut...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20205#discussion_r160579252 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java --- @@ -196,17 +234,26 @@ public void initBatch( * by copying from ORC VectorizedRowBatch columns to Spark ColumnarBatch columns. */ private boolean nextBatch() throws IOException { -for (WritableColumnVector vector : columnVectors) { - vector.reset(); -} -columnarBatch.setNumRows(0); --- End diff -- it's moved to https://github.com/apache/spark/pull/20205/files#diff-e594f7295e5408c01ace8175166313b6R253 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20207: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20207 **[Test build #85896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85896/testReport)** for PR 20207 at commit [`f239b2b`](https://github.com/apache/spark/commit/f239b2b51f32addf82ad5454e86ecbb9bcbe65a2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19885 **[Test build #85897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85897/testReport)** for PR 19885 at commit [`778e1ef`](https://github.com/apache/spark/commit/778e1ef903d46a46f3a389fc9b5bf8038ac7cb71). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20207: [SPARK-23000] [TEST-HADOOP2.6] Fix Flaky test suite Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20207 **[Test build #85898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85898/testReport)** for PR 20207 at commit [`9039817`](https://github.com/apache/spark/commit/9039817d516f2cfb68f9caa41374098780fde3ca). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20013#discussion_r160580420 --- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala --- @@ -313,35 +316,68 @@ private class LiveExecutor(val executorId: String, _addTime: Long) extends LiveE } -/** Metrics tracked per stage (both total and per executor). */ -private class MetricsTracker { - var executorRunTime = 0L - var executorCpuTime = 0L - var inputBytes = 0L - var inputRecords = 0L - var outputBytes = 0L - var outputRecords = 0L - var shuffleReadBytes = 0L - var shuffleReadRecords = 0L - var shuffleWriteBytes = 0L - var shuffleWriteRecords = 0L - var memoryBytesSpilled = 0L - var diskBytesSpilled = 0L - - def update(delta: v1.TaskMetrics): Unit = { -executorRunTime += delta.executorRunTime -executorCpuTime += delta.executorCpuTime -inputBytes += delta.inputMetrics.bytesRead -inputRecords += delta.inputMetrics.recordsRead -outputBytes += delta.outputMetrics.bytesWritten -outputRecords += delta.outputMetrics.recordsWritten -shuffleReadBytes += delta.shuffleReadMetrics.localBytesRead + - delta.shuffleReadMetrics.remoteBytesRead -shuffleReadRecords += delta.shuffleReadMetrics.recordsRead -shuffleWriteBytes += delta.shuffleWriteMetrics.bytesWritten -shuffleWriteRecords += delta.shuffleWriteMetrics.recordsWritten -memoryBytesSpilled += delta.memoryBytesSpilled -diskBytesSpilled += delta.diskBytesSpilled +private class MetricsTracker( --- End diff -- ok, but the code becomes uglier because `v1.TaskMetrics` is kind of an annoying type to use. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20013 **[Test build #85899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85899/testReport)** for PR 20013 at commit [`3d66505`](https://github.com/apache/spark/commit/3d6650589223cfd93c4ced9fb25246bcd88ca899). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20205: [SPARK-16060][SQL][follow-up] add a wrapper solut...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20205#discussion_r160581341 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java --- @@ -196,17 +234,26 @@ public void initBatch( * by copying from ORC VectorizedRowBatch columns to Spark ColumnarBatch columns. */ private boolean nextBatch() throws IOException { -for (WritableColumnVector vector : columnVectors) { - vector.reset(); -} -columnarBatch.setNumRows(0); --- End diff -- Yep. I meant keeping here since we return at line 390 and 240. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20214: [SPARK-23023][SQL] Cast field data to strings in ...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/20214 [SPARK-23023][SQL] Cast field data to strings in showString ## What changes were proposed in this pull request? The current `Datset.showString` prints rows thru `RowEncoder` deserializers like; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) ++ |a | ++ |[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]| ++ ``` This result is incorrect because the correct one is; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) ++ |a | ++ |[[1, 2], [3], [4, 5, 6]]| ++ ``` So, this pr fixed code in `showString` to cast field data to strings before printing. ## How was this patch tested? Added tests in `DataFrameSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-23023 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20214.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20214 commit eb56aff74352a360d1d4b1273be23b670f3c958a Author: Takeshi YamamuroDate: 2018-01-06T11:05:54Z Cast data to strings in showString --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19885 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85897/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19885 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19885 **[Test build #85897 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85897/testReport)** for PR 19885 at commit [`778e1ef`](https://github.com/apache/spark/commit/778e1ef903d46a46f3a389fc9b5bf8038ac7cb71). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20214 **[Test build #85900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85900/testReport)** for PR 20214 at commit [`eb56aff`](https://github.com/apache/spark/commit/eb56aff74352a360d1d4b1273be23b670f3c958a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18853 **[Test build #85846 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85846/testReport)** for PR 18853 at commit [`408e889`](https://github.com/apache/spark/commit/408e889caa8d61b7267f0f391be4af5fde82a0c9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18853 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18853 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85846/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20195 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85842/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20195 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r160376096 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType { val MAX_SCALE = 38 val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18) val USER_DEFAULT: DecimalType = DecimalType(10, 0) + val MINIMUM_ADJUSTED_SCALE = 6 --- End diff -- how about `spark.sql.decimalOperations.allowTruncat`? Let's leave the mode stuff to the type coercion mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20195: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20195 **[Test build #85842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85842/testReport)** for PR 20195 at commit [`f55ace6`](https://github.com/apache/spark/commit/f55ace645b46a429a512eb8e922a7074c4cd8cc0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18853 **[Test build #85849 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85849/testReport)** for PR 18853 at commit [`e763330`](https://github.com/apache/spark/commit/e763330edae88d4dad410214608fb5448d90a989). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r160376186 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType { val MAX_SCALE = 38 val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18) val USER_DEFAULT: DecimalType = DecimalType(10, 0) + val MINIMUM_ADJUSTED_SCALE = 6 --- End diff -- We should make it an internal conf and remove it after some releases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20199: [Spark-22967][Hive]Fix VersionSuite's unit tests by chan...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20199 **[Test build #85847 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85847/testReport)** for PR 20199 at commit [`22669d1`](https://github.com/apache/spark/commit/22669d1ff0cb00261fa146d276af237c115a0488). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20189 **[Test build #85850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85850/testReport)** for PR 20189 at commit [`7242eab`](https://github.com/apache/spark/commit/7242eabe00ce84cb132a4a4f16cb53bed1e6afa7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org