[GitHub] spark pull request #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstra...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22203#discussion_r212535154 --- Diff: dev/deps/spark-deps-hadoop-2.6 --- @@ -98,7 +98,7 @@ jackson-module-jaxb-annotations-2.6.7.jar jackson-module-paranamer-2.7.9.jar jackson-module-scala_2.11-2.6.7.1.jar jackson-xc-1.9.13.jar -janino-3.0.8.jar +janino-3.0.9.jar --- End diff -- Sorry for being late. I confirmed all of changes in the changelog. Two issues https://github.com/janino-compiler/janino/pull/54 and https://github.com/janino-compiler/janino/pull/46 are for Java 8 and later support. Others are bug fixes. LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22199: [SPARK-25073][Yarn] AM and Executor Memory validation me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22199 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95196/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22199: [SPARK-25073][Yarn] AM and Executor Memory validation me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22199 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22199: [SPARK-25073][Yarn] AM and Executor Memory validation me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22199 **[Test build #95196 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95196/testReport)** for PR 22199 at commit [`3f4051c`](https://github.com/apache/spark/commit/3f4051cd0895debe1c4f288890b397d5dfe1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21912 **[Test build #95197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95197/testReport)** for PR 21912 at commit [`92428cd`](https://github.com/apache/spark/commit/92428cdad9757df9e48b726109da8c3edfa299e5). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95197/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21912 **[Test build #95197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95197/testReport)** for PR 21912 at commit [`92428cd`](https://github.com/apache/spark/commit/92428cdad9757df9e48b726109da8c3edfa299e5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2519/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212533857 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- Could you add a test case for the one you did? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212533706 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- We should respect `spark.sql.caseSensitive` in both modes, but also add a legacy SQLConf to enable users to revert back to the previous behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22199: [SPARK-25073][Yarn] AM and Executor Memory validation me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22199 **[Test build #95196 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95196/testReport)** for PR 22199 at commit [`3f4051c`](https://github.com/apache/spark/commit/3f4051cd0895debe1c4f288890b397d5dfe1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22199: [SPARK-25073][Yarn] AM and Executor Memory validation me...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/22199 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22213: [SPARK-25221][DEPLOY] Consistent trailing whitesp...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/22213#discussion_r212530383 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2062,8 +2062,10 @@ private[spark] object Utils extends Logging { try { val properties = new Properties() properties.load(inReader) - properties.stringPropertyNames().asScala.map( -k => (k, properties.getProperty(k).trim)).toMap + properties.stringPropertyNames().asScala +.map(k => (k, properties.getProperty(k))) --- End diff -- @gerashegalov would you please elaborate the use case here? I saw that you're treating `\n` as a property value, what is the specific usage scenario here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22171 > How can this happen? When Spark writes decimal out, the external systems will get decimal values, not string values. I have the same question. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17254: [SPARK-19917][SQL]qualified partition path stored in cat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17254 **[Test build #95195 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95195/testReport)** for PR 17254 at commit [`36a3463`](https://github.com/apache/spark/commit/36a34632dbb000799c35727c00d1542d4bb1ce00). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20637 @cloud-fan @mgaido91 Do you have any other comments on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22187: [SPARK-25178][SQL] Directly ship the StructType o...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22187 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22213: [SPARK-25221][DEPLOY] Consistent trailing whitesp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22213#discussion_r212528202 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2062,8 +2062,10 @@ private[spark] object Utils extends Logging { try { val properties = new Properties() properties.load(inReader) - properties.stringPropertyNames().asScala.map( -k => (k, properties.getProperty(k).trim)).toMap + properties.stringPropertyNames().asScala +.map(k => (k, properties.getProperty(k))) --- End diff -- Isn't this going to break existing apps dependent on trimmed values? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21931 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95190/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22171 I don't have a strong opinion about the display, scientific notation is fine to me. I'm curious about "... but also in dataset write operations. External databases like netezza fails to save the result ..." How can this happen? When Spark writes decimal out, the external systems will get decimal values, not string values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21931 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22187 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21931 **[Test build #95190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95190/testReport)** for PR 21931 at commit [`6abeb06`](https://github.com/apache/spark/commit/6abeb06e52ed51ff9a30acc5801b0794ce778e2c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22207: [SPARK-25214][SS]Fix the issue that Kafka v2 sour...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22207#discussion_r212526373 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala --- @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.mutable +import scala.util.Random + +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter} +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +/** + * This is a basic test trait which will set up a Kafka cluster that keeps only several records in + * a topic and ages out records very quickly. This is a helper trait to test + * "failonDataLoss=false" case with missing offsets. + * + * Note: there is a hard-code 30 seconds delay (kafka.log.LogManager.InitialTaskDelayMs) to clean up + * records. Hence each class extending this trait needs to wait at least 30 seconds (or even longer + * when running on a slow Jenkins machine) before records start to be removed. To make sure a test + * does see missing offsets, you can check the earliest offset in `eventually` and make sure it's + * not 0 rather than sleeping a hard-code duration. + */ +trait KafkaMissingOffsetsTest extends SharedSQLContext { + + protected var testUtils: KafkaTestUtils = _ + + override def createSparkSession(): TestSparkSession = { +// Set maxRetries to 3 to handle NPE from `poll` when deleting a topic +new TestSparkSession(new SparkContext("local[2,3]", "test-sql-context", sparkConf)) + } + + override def beforeAll(): Unit = { +super.beforeAll() +testUtils = new KafkaTestUtils { + override def brokerConfiguration: Properties = { +val props = super.brokerConfiguration +// Try to make Kafka clean up messages as fast as possible. However, there is a hard-code +// 30 seconds delay (kafka.log.LogManager.InitialTaskDelayMs) so this test should run at +// least 30 seconds. +props.put("log.cleaner.backoff.ms", "100") +// The size of RecordBatch V2 increases to support transactional write. +props.put("log.segment.bytes", "70") +props.put("log.retention.bytes", "40") +props.put("log.retention.check.interval.ms", "100") +props.put("delete.retention.ms", "10") +props.put("log.flush.scheduler.interval.ms", "10") +props + } +} +testUtils.setup() + } + + override def afterAll(): Unit = { +if (testUtils != null) { + testUtils.teardown() + testUtils = null +} +super.afterAll() + } +} + +class KafkaDontFailOnDataLossSuite extends KafkaMissingOffsetsTest { + + import testImplicits._ + + private val topicId = new AtomicInteger(0) + + private def newTopic(): String = s"failOnDataLoss-${topicId.getAndIncrement()}" + + /** + * @param testStreamingQuery whether to test a streaming query or a batch query. + * @param writeToTable the function to write the specified [[DataFrame]] to the given table. + */ + private def verifyMissingOffsetsDontCauseDuplicatedRecords( + testStreamingQuery: Boolean)(writeToTable: (DataFrame, String) => Unit): Unit = { +val topic = newTopic() +testUtils.createTopic(topic, partitions = 1) +testUtils.sendMessages(topic, (0 until 50).map(_.toString).toArray) + +eventually(timeout(60.seconds)) { + assert( +testUtils.getEarliestOffsets(Set(topic)).head._2 > 0, +"Kafka didn't delete records after 1 minute") +} + +
[GitHub] spark pull request #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-251...
Github user bersprockets closed the pull request at: https://github.com/apache/spark/pull/22079 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212507190 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala --- @@ -160,6 +160,23 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext { s"AddKafkaData(topics = $topics, data = $data, message = $message)" } + object WithKafkaProducer { +def apply( +topic: String, +producer: KafkaProducer[String, String])( --- End diff -- Ping on this comment. Maybe you missed this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212522664 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala --- @@ -161,6 +161,22 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext with Kaf s"AddKafkaData(topics = $topics, data = $data, message = $message)" } + object WithKafkaProducer { --- End diff -- nit: This is not creating a KafkaProducer .. as most `With***` methods. The point of this is to force synchronization of the consumer. So maybe rename it to `WithOffsetSync { ... }`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212521083 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala --- @@ -239,56 +335,74 @@ private[kafka010] case class InternalKafkaConsumer( } /** - * Get the record for the given offset if available. Otherwise it will either throw error - * (if failOnDataLoss = true), or return the next available offset within [offset, untilOffset), - * or null. + * Get the fetched record for the given offset if available. + * + * If the record is invisible (either a transaction message, or an aborted message when the + * consumer's `isolation.level` is `read_committed`), it will return a `FetchedRecord` with the + * next offset to fetch. + * + * This method also will try the best to detect data loss. If `failOnDataLoss` is true`, it will + * throw an exception when we detect an unavailable offset. If `failOnDataLoss` is `false`, this + * method will return `null` if the next available record is within [offset, untilOffset). * * @throws OffsetOutOfRangeException if `offset` is out of range * @throws TimeoutException if cannot fetch the record in `pollTimeoutMs` milliseconds. */ - private def fetchData( + private def fetchRecord( offset: Long, untilOffset: Long, pollTimeoutMs: Long, - failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]] = { -if (offset != nextOffsetInFetchedData || !fetchedData.hasNext()) { - // This is the first fetch, or the last pre-fetched data has been drained. - // Seek to the offset because we may call seekToBeginning or seekToEnd before this. - seek(offset) - poll(pollTimeoutMs) -} - -if (!fetchedData.hasNext()) { - // We cannot fetch anything after `poll`. Two possible cases: - // - `offset` is out of range so that Kafka returns nothing. Just throw - // `OffsetOutOfRangeException` to let the caller handle it. - // - Cannot fetch any data before timeout. TimeoutException will be thrown. - val range = getAvailableOffsetRange() - if (offset < range.earliest || offset >= range.latest) { -throw new OffsetOutOfRangeException( - Map(topicPartition -> java.lang.Long.valueOf(offset)).asJava) + failOnDataLoss: Boolean): FetchedRecord = { +if (offset != fetchedData.nextOffsetInFetchedData) { + // This is the first fetch, or the fetched data has been reset. + // Fetch records from Kafka and update `fetchedData`. + fetchData(offset, pollTimeoutMs) +} else if (!fetchedData.hasNext) { // The last pre-fetched data has been drained. + if (offset < fetchedData.offsetAfterPoll) { +// Offsets in [offset, fetchedData.offsetAfterPoll) are invisible. Return a record to ask +// the next call to start from `fetchedData.offsetAfterPoll`. +fetchedData.reset() +return fetchedRecord.withRecord(null, fetchedData.offsetAfterPoll) } else { -throw new TimeoutException( - s"Cannot fetch record for offset $offset in $pollTimeoutMs milliseconds") +// Fetch records from Kafka and update `fetchedData`. +fetchData(offset, pollTimeoutMs) } +} + +if (!fetchedData.hasNext) { + // When we reach here, we have already tried to poll from Kafka. As `fetchedData` is still + // empty, all messages in [offset, fetchedData.offsetAfterPoll) are invisible. Return a + // record to ask the next call to start from `fetchedData.offsetAfterPoll`. + assert(offset <= fetchedData.offsetAfterPoll, +s"seek to $offset and poll but the offset was reset to ${fetchedData.offsetAfterPoll}") + fetchedRecord.withRecord(null, fetchedData.offsetAfterPoll) } else { val record = fetchedData.next() - nextOffsetInFetchedData = record.offset + 1 // In general, Kafka uses the specified offset as the start point, and tries to fetch the next // available offset. Hence we need to handle offset mismatch. if (record.offset > offset) { +val range = getAvailableOffsetRange() +if (range.earliest <= offset) { + // `offset` is still valid but the corresponding message is invisible. We should skip it + // and jump to `record.offset`. Here we move `fetchedData` back so that the next call of + // `fetchRecord` can just return `record` directly. + fetchedData.previous() + return fetchedRecord.withRecord(null, record.offset) +} // This
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212522432 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala --- @@ -161,6 +161,22 @@ abstract class KafkaSourceTest extends StreamTest with SharedSQLContext with Kaf s"AddKafkaData(topics = $topics, data = $data, message = $message)" } + object WithKafkaProducer { +def apply( +topic: String, +producer: KafkaProducer[String, String])( +func: KafkaProducer[String, String] => Unit): AssertOnQuery = { --- End diff -- nit: AssertOnQuery -> StreamAction --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r212504622 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReadSupport.scala --- @@ -331,6 +331,7 @@ private[kafka010] case class KafkaMicroBatchPartitionReader( offsetRange.topicPartition, executorKafkaParams, reuseKafkaConsumer) private val rangeToRead = resolveRange(offsetRange) + --- End diff -- unnecessary --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21087 **[Test build #95194 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95194/testReport)** for PR 21087 at commit [`ebd9265`](https://github.com/apache/spark/commit/ebd926530c1d8b2f515a4a233f5963eafc17e460). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r212521169 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -164,9 +165,12 @@ case class BucketSpec( numBuckets: Int, bucketColumnNames: Seq[String], sortColumnNames: Seq[String]) { - if (numBuckets <= 0 || numBuckets >= 10) { + def conf: SQLConf = SQLConf.get + + if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) { --- End diff -- We can merge this PR first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21087 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r212521145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -164,9 +165,12 @@ case class BucketSpec( numBuckets: Int, bucketColumnNames: Seq[String], sortColumnNames: Seq[String]) { - if (numBuckets <= 0 || numBuckets >= 10) { + def conf: SQLConf = SQLConf.get + + if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) { --- End diff -- Could you submit a followup PR to address this message issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #95193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95193/testReport)** for PR 21732 at commit [`5f95bd0`](https://github.com/apache/spark/commit/5f95bd0cf1bd308c7df55c41caef7a9f19368f5d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2518/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r212520685 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -19,25 +19,85 @@ package org.apache.spark.sql.execution.aggregate import scala.language.existentials -import org.apache.spark.sql.Encoder +import org.apache.spark.sql.{AnalysisException, Encoder} import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer -import org.apache.spark.sql.catalyst.encoders.encoderFor +import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, UnresolvedDeserializer} +import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder} import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateFunction, DeclarativeAggregate, TypedImperativeAggregate} import org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection -import org.apache.spark.sql.catalyst.expressions.objects.Invoke +import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull, Invoke, NewInstance, WrapOption} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.sql.types._ import org.apache.spark.util.Utils object TypedAggregateExpression { + + // Checks if given encoder is for `Option[Product]`. + def isOptProductEncoder(encoder: ExpressionEncoder[_]): Boolean = { +// Only Option[Product] is non-flat. +encoder.clsTag.runtimeClass == classOf[Option[_]] && !encoder.flat + } + + /** + * Flattens serializers and deserializer of given encoder. We only flatten encoder + * of `Option[Product]` class. + */ + def flattenOptProductEncoder(encoder: ExpressionEncoder[_]): ExpressionEncoder[_] = { --- End diff -- I will go to add some tests against this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2517/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #95192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95192/testReport)** for PR 21732 at commit [`9fc3f61`](https://github.com/apache/spark/commit/9fc3f6165156051142a8366a32726badaaa16bb7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstra...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22203 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22079 @bersprockets Could you please close this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22203 Thanks! Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21320 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2516/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21732 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21732 @cloud-fan I made attempt to remove `topLevel` parameter. The approach is to flatten serializers and deserialzer at `TypedAggregateExpression`. So users are not aware of difference when using expression encoder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21320 Thanks! Merged to master. BTW, we can keep thinking whether there are other better solutions for nested column pruning. Also cc @dongjoon-hyun If you are interested in the work for supporting ORC nested column pruning. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21732 **[Test build #95191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95191/testReport)** for PR 21732 at commit [`ed3d5cb`](https://github.com/apache/spark/commit/ed3d5cb697b10af2e2cf4c78ab521d4d0b2f3c9b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22211 cc @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22206 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95187/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22206 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22190 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95188/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: SPARK-25213: Add project to v2 scans before python filte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22206 **[Test build #95187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95187/testReport)** for PR 22206 at commit [`550368e`](https://github.com/apache/spark/commit/550368eaeebdc87f2c89bad7214f2624784eeb04). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22190: [SPARK-25188][SQL] Add WriteConfig to v2 write API.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22190 **[Test build #95188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95188/testReport)** for PR 22190 at commit [`847300f`](https://github.com/apache/spark/commit/847300f76391c5e171e3f54b21bf6f2efc177f0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21273 @koertkuipers you wanna make a PR to make it configuration? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user gerashegalov commented on the issue: https://github.com/apache/spark/pull/22213 @witgo please take a look since you worked on #2379 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22213 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22213: [SPARK-25221][DEPLOY] Consistent trailing whitesp...
GitHub user gerashegalov opened a pull request: https://github.com/apache/spark/pull/22213 [SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values ## What changes were proposed in this pull request? Stop trimming values of properties loaded from a file ## How was this patch tested? Added unit test demonstrating the issue hit in production. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gerashegalov/spark gera/SPARK-25221 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22213 commit 874b2c82a83d94c338d8b1ef9c3b37074cc2e4cc Author: Gera Shegalov Date: 2018-08-23T19:09:05Z [SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22211 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95189/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22211 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22211 **[Test build #95189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95189/testReport)** for PR 22211 at commit [`d269015`](https://github.com/apache/spark/commit/d26901516f414d16f8eb38183053d22dc3b056e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95185/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11105: [SPARK-12469][CORE] Data Property accumulators for Spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11105 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2515/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11105: [SPARK-12469][CORE] Data Property accumulators for Spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11105 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 i would suggest at least that when the quote character is changed that the empty value should change accordingly. an empty value of ```""``` makes no sense if the quote character is not ```"```. also if we could agree on a quote character that means no quotes at all then i would suggest to change empty value back to null if that particular quote character is set. because no quoted empty string makes sense if the user is trying to write out unquoted values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #95185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95185/testReport)** for PR 21320 at commit [`e6baf68`](https://github.com/apache/spark/commit/e6baf681e06e229d740af120491d1bf0f426af99). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22212 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22212 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22201: [SPARK-25209][SQL] Avoid deserializer check in Da...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22201 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22212 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22212: [SPARK-25220] Seperate kubernetes node selector c...
GitHub user jweaver-personal opened a pull request: https://github.com/apache/spark/pull/22212 [SPARK-25220] Seperate kubernetes node selector config between driver and executors. Seperated node selector config option between executors and driver. This removes the spark.kubernetes.node.selector config option and seperates it to spark.kubernetes.driver.selector and spark.kubernetes.executor.selector to allow seperate node selectors on drivers and executors. My personal use case for this change is that on AWS we have cheap spotinstances that can terminate at any moment which is okay for executors but not the driver. With a single node selector option I am unable to use the spot instances to save costs on the executor nodes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jweaver-personal/spark jweaver-dev Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22212.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22212 commit a6c6371f8d0ba5e7ff55e83f7c8c955bf24096aa Author: ? Date: 2018-08-24T01:55:06Z KUBERNETES K8S Seperated node selector config option between executors and driver. This removes the spark.kubernetes.node.selector config option and seperates it to spark.kubernetes.driver.selector and spark.kubernetes.executor.selector to allow seperate node selectors on drivers and executors. My personal use case for this change is that on AWS we have cheap spotinstances that can terminate at any moment which is okay for executors but not the driver. With a single node selector option I am unable to use the spot instances to save costs on the executor nodes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21931 **[Test build #95190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95190/testReport)** for PR 21931 at commit [`6abeb06`](https://github.com/apache/spark/commit/6abeb06e52ed51ff9a30acc5801b0794ce778e2c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212504250 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- +1 to all your comments. I'm overhauling this whole PR and will force push with a rebase once it seems to basically work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20611 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20611 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22195 cc @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22195: [SPARK-25205][CORE] Fix typo in spark.network.cry...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22195 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22195 Merged to master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22208: [SPARK-25216][SQL] Improve error message when a c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22208#discussion_r212500744 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -216,8 +216,16 @@ class Dataset[T] private[sql]( private[sql] def resolve(colName: String): NamedExpression = { queryExecution.analyzed.resolveQuoted(colName, sparkSession.sessionState.analyzer.resolver) .getOrElse { -throw new AnalysisException( - s"""Cannot resolve column name "$colName" among (${schema.fieldNames.mkString(", ")})""") +if (schema.fieldNames.contains(colName)) { + throw new AnalysisException( +s"""Cannot resolve column name "$colName" among (${schema.fieldNames.mkString(", ")}). + | Try adding backticks to the column name, i.e., `$colName`""" --- End diff -- I would explain, for instance, if the name parts in the column should be kept as the part of its column name, try to quote them by backticks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22171 cc @cloud-fan and @gatorsmile, WDYT? actually I happened to meet this and thought it's better be fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21977 Seems okay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212499322 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -40,7 +40,14 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +nullableTypes: Option[Seq[Boolean]] = None) { --- End diff -- or `Option[Seq[ScalaReflection.Schema]]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r212499164 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -57,8 +59,21 @@ case class ScalaUDF( children: Seq[Expression], inputTypes: Seq[DataType], udfName: Option[String]) = { -this( - function, dataType, children, inputTypes, udfName, nullable = true, udfDeterministic = true) +this(function, dataType, children, inputTypes, udfName, + nullable = true, udfDeterministic = true, nullableTypes = Nil) + } + + // Constructor from Spark 2.3 --- End diff -- By convention, everything under catalyst package is private, so compatibility is not a concern here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r212499194 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -60,14 +61,20 @@ private[spark] object PythonEvalType { */ private[spark] abstract class BasePythonRunner[IN, OUT]( funcs: Seq[ChainedPythonFunctions], -bufferSize: Int, -reuseWorker: Boolean, evalType: Int, -argOffsets: Array[Array[Int]]) +argOffsets: Array[Array[Int]], +conf: SparkConf) extends Logging { require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs") + private val bufferSize = conf.getInt("spark.buffer.size", 65536) + private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true) + // each python worker gets an equal part of the allocation. the worker pool will grow to the + // number of concurrent tasks, which is determined by the number of cores in this executor. + private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY) + .map(_ / conf.getInt("spark.executor.cores", 1)) --- End diff -- > I don't think that the DataBricks style guide applies to Apache projects. I sent an email to dev mailing list - http://apache-spark-developers-list.1001551.n3.nabble.com/Porting-or-explicitly-linking-project-style-in-Apache-Spark-based-on-https-github-com-databricks-scae-td24790.html I was thinking 2 indents for continuation lines are more common in the codebase and thought better follow this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...
Github user ankuriitg commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212498716 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener( val e = it.next() if (job.stageIds.contains(e.getKey()._1)) { val stage = e.getValue() - stage.status = v1.StageStatus.SKIPPED - job.skippedStages += stage.info.stageId - job.skippedTasks += stage.info.numTasks - it.remove() - update(stage, now) + // Only update the stage if it has not finished already + if (v1.StageStatus.ACTIVE.equals(stage.status) || + v1.StageStatus.PENDING.equals(stage.status)) { +stage.status = v1.StageStatus.SKIPPED +job.skippedStages += stage.info.stageId +job.skippedTasks += stage.info.numTasks +job.activeStages -= 1 +it.remove() --- End diff -- I think the assumption here is that we will always receive onStageCompleted event before onJobEvent. If that does not occur for some reason, then any active stages are marked as skipped. I don't know the scenario when onStageCompleted event is not received before onJobEnd event (or received at all). Let me look further into it. Additionally, I will also fix the bug for updating pool. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21320 Seems fine to me too for the similar reasons. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212497182 --- Diff: python/pyspark/sql/tests.py --- @@ -6394,6 +6394,17 @@ def test_invalid_args(self): df.withColumn('mean_v', mean_udf(df['v']).over(ow)) +class DataSourceV2Tests(ReusedSQLTestCase): +def test_pyspark_udf_SPARK_25213(self): +from pyspark.sql.functions import udf + +df = self.spark.read.format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2").load() +result = df.withColumn('x', udf(lambda x: x, 'int')(df['i'])) --- End diff -- This only tests Project with Scalar PythonUDF? Might be better to also test Filter case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: SPARK-25213: Add project to v2 scans before pytho...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22206#discussion_r212496291 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -130,10 +133,22 @@ object DataSourceV2Strategy extends Strategy { config) val filterCondition = postScanFilters.reduceLeftOption(And) - val withFilter = filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + + val withFilter = if (filterCondition.exists(hasScalarPythonUDF)) { +// add a projection before FilterExec to ensure that the rows are converted to unsafe +val filterExpr = filterCondition.get +FilterExec(filterExpr, ProjectExec(filterExpr.references.toSeq, scan)) + } else { +filterCondition.map(FilterExec(_, scan)).getOrElse(scan) + } // always add the projection, which will produce unsafe rows required by some operators - ProjectExec(project, withFilter) :: Nil + if (project.exists(hasScalarPythonUDF)) { +val references = project.map(_.references).reduce(_ ++ _).toSeq +ProjectExec(project, ProjectExec(references, withFilter)) :: Nil --- End diff -- Ok. Let's leave as it is now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22210 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95184/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22210 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22210 **[Test build #95184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95184/testReport)** for PR 22210 at commit [`6f2248d`](https://github.com/apache/spark/commit/6f2248d45716332ac78e44b1314011806f59deb8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org