[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20648 > allows the length of tokens are shorter than its schema, putting nulls (or NA) into missing fields Actually I also recalled this is a valid case for csv, and I remember that we did this intentionally. How much can we clean up if we want to keep this behavior in csv? If it's not a lot, maybe we don't need to bother to do refactoring. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170279504 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala --- @@ -64,6 +69,41 @@ class KafkaRDDSuite extends SparkFunSuite with BeforeAndAfterAll { private val preferredHosts = LocationStrategies.PreferConsistent + private def compactLogs(topic: String, partition: Int, messages: Array[(String, String)]) { +val mockTime = new MockTime() +// LogCleaner in 0.10 version of Kafka is still expecting the old TopicAndPartition api +val logs = new Pool[TopicAndPartition, Log]() +val logDir = kafkaTestUtils.brokerLogDir +val dir = new java.io.File(logDir, topic + "-" + partition) +dir.mkdirs() +val logProps = new ju.Properties() +logProps.put(LogConfig.CleanupPolicyProp, LogConfig.Compact) +logProps.put(LogConfig.MinCleanableDirtyRatioProp, 0.1f: java.lang.Float) +val log = new Log( + dir, + LogConfig(logProps), + 0L, + mockTime.scheduler, + mockTime +) +messages.foreach { case (k, v) => +val msg = new ByteBufferMessageSet( --- End diff -- Unindent one level? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170278317 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala --- @@ -162,17 +162,22 @@ private[kafka010] class KafkaTestUtils extends Logging { } /** Create a Kafka topic and wait until it is propagated to the whole cluster */ - def createTopic(topic: String, partitions: Int): Unit = { -AdminUtils.createTopic(zkUtils, topic, partitions, 1) + def createTopic(topic: String, partitions: Int, config: Properties): Unit = { +AdminUtils.createTopic(zkUtils, topic, partitions, 1, config) // wait until metadata is propagated (0 until partitions).foreach { p => waitUntilMetadataIsPropagated(topic, p) } } + /** Create a Kafka topic and wait until it is propagated to the whole cluster */ + def createTopic(topic: String, partitions: Int): Unit = { +createTopic(topic, partitions, new Properties) + } + /** Create a Kafka topic and wait until it is propagated to the whole cluster */ def createTopic(topic: String): Unit = { -createTopic(topic, 1) +createTopic(topic, 1, new Properties) --- End diff -- Nit: `new Properties()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170278931 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala --- @@ -71,25 +69,62 @@ class CachedKafkaConsumer[K, V] private( } if (!buffer.hasNext()) { poll(timeout) } -assert(buffer.hasNext(), +require(buffer.hasNext(), s"Failed to get records for $groupId $topic $partition $offset after polling for $timeout") var record = buffer.next() if (record.offset != offset) { logInfo(s"Buffer miss for $groupId $topic $partition $offset") seek(offset) poll(timeout) - assert(buffer.hasNext(), + require(buffer.hasNext(), s"Failed to get records for $groupId $topic $partition $offset after polling for $timeout") record = buffer.next() - assert(record.offset == offset, -s"Got wrong record for $groupId $topic $partition even after seeking to offset $offset") + require(record.offset == offset, +s"Got wrong record for $groupId $topic $partition even after seeking to offset $offset " + + s"got offset ${record.offset} instead. If this is a compacted topic, consider enabling " + + "spark.streaming.kafka.allowNonConsecutiveOffsets" + ) } nextOffset = offset + 1 record } + /** + * Start a batch on a compacted topic + */ + def compactedStart(offset: Long, timeout: Long): Unit = { +logDebug(s"compacted start $groupId $topic $partition starting $offset") +// This seek may not be necessary, but it's hard to tell due to gaps in compacted topics +if (offset != nextOffset) { + logInfo(s"Initial fetch for compacted $groupId $topic $partition $offset") + seek(offset) + poll(timeout) +} + } + + /** + * Get the next record in the batch from a compacted topic. + * Assumes compactedStart has been called first, and ignores gaps. + */ + def compactedNext(timeout: Long): ConsumerRecord[K, V] = { +if (!buffer.hasNext()) { poll(timeout) } --- End diff -- Nit: I'd expand this onto two lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170278685 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/mocks/MockScheduler.scala --- @@ -0,0 +1,100 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.kafka010.mocks + +import java.util.concurrent.TimeUnit + +import scala.collection.mutable.PriorityQueue + +import kafka.utils.{Scheduler, Time} + +/** + * A mock scheduler that executes tasks synchronously using a mock time instance. + * Tasks are executed synchronously when the time is advanced. + * This class is meant to be used in conjunction with MockTime. + * + * Example usage + * + * val time = new MockTime + * time.scheduler.schedule("a task", println("hello world: " + time.milliseconds), delay = 1000) + * time.sleep(1001) // this should cause our scheduled task to fire + * + * + * Incrementing the time to the exact next execution time of a task will result in that task + * executing (it as if execution itself takes no time). + */ +private[kafka010] class MockScheduler(val time: Time) extends Scheduler { + + /* a priority queue of tasks ordered by next execution time */ + var tasks = new PriorityQueue[MockTask]() + + def isStarted: Boolean = true + + def startup(): Unit = {} + + def shutdown(): Unit = synchronized { +tasks.foreach(_.fun()) +tasks.clear() + } + + /** + * Check for any tasks that need to execute. Since this is a mock scheduler this check only occurs + * when this method is called and the execution happens synchronously in the calling thread. + * If you are using the scheduler associated with a MockTime instance this call + * will be triggered automatically. + */ + def tick() { +this synchronized { + val now = time.milliseconds + while(!tasks.isEmpty && tasks.head.nextExecution <= now) { +/* pop and execute the task with the lowest next execution time */ +val curr = tasks.dequeue +curr.fun() +/* if the task is periodic, reschedule it and re-enqueue */ +if(curr.periodic) { + curr.nextExecution += curr.period + this.tasks += curr +} + } +} + } + + def schedule( + name: String, + fun: () => Unit, + delay: Long = 0, + period: Long = -1, + unit: TimeUnit = TimeUnit.MILLISECONDS) { +this synchronized { --- End diff -- I think I'd still write such methods as: ``` def foo(): T = synchronized { ... } ``` This is how the rest of the code base does it (and other Scala code I've seen). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170279950 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala --- @@ -64,6 +69,41 @@ class KafkaRDDSuite extends SparkFunSuite with BeforeAndAfterAll { private val preferredHosts = LocationStrategies.PreferConsistent + private def compactLogs(topic: String, partition: Int, messages: Array[(String, String)]) { +val mockTime = new MockTime() +// LogCleaner in 0.10 version of Kafka is still expecting the old TopicAndPartition api +val logs = new Pool[TopicAndPartition, Log]() +val logDir = kafkaTestUtils.brokerLogDir +val dir = new java.io.File(logDir, topic + "-" + partition) +dir.mkdirs() +val logProps = new ju.Properties() +logProps.put(LogConfig.CleanupPolicyProp, LogConfig.Compact) +logProps.put(LogConfig.MinCleanableDirtyRatioProp, 0.1f: java.lang.Float) --- End diff -- Do you have to 'cast' this to a Java Float object to get it to compile? `java.lang.Float.valueOf(0.1f)` works too I guess, but equally weird. OK if it's required. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170278078 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala --- @@ -64,6 +69,41 @@ class KafkaRDDSuite extends SparkFunSuite with BeforeAndAfterAll { private val preferredHosts = LocationStrategies.PreferConsistent + private def compactLogs(topic: String, partition: Int, messages: Array[(String, String)]) { +val mockTime = new MockTime() +// LogCleaner in 0.10 version of Kafka is still expecting the old TopicAndPartition api +val logs = new Pool[TopicAndPartition, Log]() +val logDir = kafkaTestUtils.brokerLogDir +val dir = new java.io.File(logDir, topic + "-" + partition) --- End diff -- Import `File`, other `java.*` classes? maybe I'm missing a name conflict. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170277915 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala --- @@ -172,57 +187,138 @@ private[spark] class KafkaRDD[K, V]( override def compute(thePart: Partition, context: TaskContext): Iterator[ConsumerRecord[K, V]] = { val part = thePart.asInstanceOf[KafkaRDDPartition] -assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part)) +require(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part)) if (part.fromOffset == part.untilOffset) { logInfo(s"Beginning offset ${part.fromOffset} is the same as ending offset " + s"skipping ${part.topic} ${part.partition}") Iterator.empty } else { - new KafkaRDDIterator(part, context) + logInfo(s"Computing topic ${part.topic}, partition ${part.partition} " + +s"offsets ${part.fromOffset} -> ${part.untilOffset}") + if (compacted) { +new CompactedKafkaRDDIterator[K, V]( + part, + context, + kafkaParams, + useConsumerCache, + pollTimeout, + cacheInitialCapacity, + cacheMaxCapacity, + cacheLoadFactor +) + } else { +new KafkaRDDIterator[K, V]( + part, + context, + kafkaParams, + useConsumerCache, + pollTimeout, + cacheInitialCapacity, + cacheMaxCapacity, + cacheLoadFactor +) + } } } +} - /** - * An iterator that fetches messages directly from Kafka for the offsets in partition. - * Uses a cached consumer where possible to take advantage of prefetching - */ - private class KafkaRDDIterator( - part: KafkaRDDPartition, - context: TaskContext) extends Iterator[ConsumerRecord[K, V]] { - -logInfo(s"Computing topic ${part.topic}, partition ${part.partition} " + - s"offsets ${part.fromOffset} -> ${part.untilOffset}") - -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] +/** + * An iterator that fetches messages directly from Kafka for the offsets in partition. + * Uses a cached consumer where possible to take advantage of prefetching + */ +private class KafkaRDDIterator[K, V]( + part: KafkaRDDPartition, + context: TaskContext, + kafkaParams: ju.Map[String, Object], + useConsumerCache: Boolean, + pollTimeout: Long, + cacheInitialCapacity: Int, + cacheMaxCapacity: Int, + cacheLoadFactor: Float +) extends Iterator[ConsumerRecord[K, V]] { + + val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] + + context.addTaskCompletionListener{ context => closeIfNeeded() } --- End diff -- This could be `...(_ => closeIfNeeded())` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r170279150 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala --- @@ -87,47 +89,60 @@ private[spark] class KafkaRDD[K, V]( }.toArray } - override def count(): Long = offsetRanges.map(_.count).sum + override def count(): Long = +if (compacted) { + super.count() +} else { + offsetRanges.map(_.count).sum +} override def countApprox( timeout: Long, confidence: Double = 0.95 - ): PartialResult[BoundedDouble] = { -val c = count -new PartialResult(new BoundedDouble(c, 1.0, c, c), true) - } + ): PartialResult[BoundedDouble] = +if (compacted) { + super.countApprox(timeout, confidence) +} else { + val c = count + new PartialResult(new BoundedDouble(c, 1.0, c, c), true) +} - override def isEmpty(): Boolean = count == 0L + override def isEmpty(): Boolean = +if (compacted) { + super.isEmpty() +} else { + count == 0L +} - override def take(num: Int): Array[ConsumerRecord[K, V]] = { -val nonEmptyPartitions = this.partitions - .map(_.asInstanceOf[KafkaRDDPartition]) - .filter(_.count > 0) + override def take(num: Int): Array[ConsumerRecord[K, V]] = +if (compacted) { + super.take(num) +} else { + val nonEmptyPartitions = this.partitions +.map(_.asInstanceOf[KafkaRDDPartition]) +.filter(_.count > 0) -if (num < 1 || nonEmptyPartitions.isEmpty) { - return new Array[ConsumerRecord[K, V]](0) -} + if (num < 1 || nonEmptyPartitions.isEmpty) { --- End diff -- I guess you could check `num < 1` before the map/filter, but it's trivial. You could write `return Array.empty[ConsumerRecord[K,V]]` too; again trivial. Since this is existing code I could see not touching it as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20664#discussion_r170279656 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext { }.collect() } + test("SPARK-23496: order of input partitions can result in severe skew in coalesce") { --- End diff -- I see, thanks, sorry, I missed it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20553 IIUC the `spark.kubernetes.executor.cores` here is just a special case for `spark.executor.cores`, for k8s backend, you shall still have to handle float values if you're to read the value of `spark.kubernetes.executor.cores`, for instance, in dynamic allocation. If that is the case here, I don't see much benefit of bringing in a new conf here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20553 also cc @cloud-fan @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20553 How do we plan to support dynamic allocation with k8s? Should we read `spark.executor.cores` or `spark.kubernetes.executor.cores` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...
Github user ala commented on a diff in the pull request: https://github.com/apache/spark/pull/20664#discussion_r170277224 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext { }.collect() } + test("SPARK-23496: order of input partitions can result in severe skew in coalesce") { --- End diff -- The test is in fact deterministic. The seed is already fixed here: https://github.com/apache/spark/blob/049f243c59737699fee54fdc9d65cbd7c788032a/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L163 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20648 +1 for disallowing it anyway if it was Wenchen's opinion too. Please go ahead. Will help double check anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20648 Yup, +1 for starting this by disallowing but up to my knowledge R's read.csv allows then the legnth of tokens are shorter then its schema, putting nulls (or NA) into missing fields, as a valid case. I was thinking of disallowing the partial results but allowing the tokens less than the schema as a valid case in CSV. I need to double check R's read.csv behaviour and the current behaviour but it was roughly my thought. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20664#discussion_r170271696 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext { }.collect() } + test("SPARK-23496: order of input partitions can result in severe skew in coalesce") { --- End diff -- this test looks to me as a good candidate for flakiness, since we are are picking random numbers. Can we set the seed in order to avoid this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20648 @HyukjinKwon From the document of `DataFrameReader.csv`, the behavior of CSV reader isn't consistent with the document. ``` `PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the malformed string into a field configured by `columnNameOfCorruptRecord`. ``` With respect to the document, I think we may need to disable it for CSV too. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87630/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20662 **[Test build #87630 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87630/testReport)** for PR 20662 at commit [`bab27e6`](https://github.com/apache/spark/commit/bab27e6ebe415508101b7f09ba9f3f04801aa818). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20664 **[Test build #87632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87632/testReport)** for PR 20664 at commit [`6d67dfc`](https://github.com/apache/spark/commit/6d67dfc1d4c012492b97873beaac5a7cbfd6f55a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20664 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1019/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20664 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...
Github user ala commented on the issue: https://github.com/apache/spark/pull/20664 @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...
GitHub user ala opened a pull request: https://github.com/apache/spark/pull/20664 [SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed by the order of input partitions ## What changes were proposed in this pull request? The algorithm in `DefaultPartitionCoalescer.setupGroups` is responsible for picking preferred locations for coalesced partitions. It analyzes the preferred locations of input partitions. It starts by trying to create one partition for each unique location in the input. However, if the the requested number of coalesced partitions is higher that the number of unique locations, it has to pick duplicate locations. Previously, the duplicate locations would be picked by iterating over the input partitions in order, and copying their preferred locations to coalesced partitions. If the input partitions were clustered by location, this could result in severe skew. With the fix, instead of iterating over the list of input partitions in order, we pick them at random. It's not perfectly balanced, but it's much better. ## How was this patch tested? Unit test reproducing the behavior was added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ala/spark SPARK-23496 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20664.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20664 commit 6d67dfc1d4c012492b97873beaac5a7cbfd6f55a Author: Ala LuszczakDate: 2018-02-23T14:37:19Z Fix SPARK-23496. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20648 Yup, it's unsupported in JSON but CSV supports it. Do you mean to disallow CSV too, or simply clean up JSON code path? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20649#discussion_r170265519 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StructTypeSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types.StructType + +class StructTypeSuite extends SparkFunSuite { + + test("SPARK-23462 lookup a single missing field should output existing fields") { --- End diff -- nit: Would it be better to combine these three tests into one? It could reduce the total test time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20648 I'll close this PR and create another PR to refactor JSON parser and related codes. Thanks @cloud-fan and @HyukjinKwon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20648: [SPARK-23448][SQL] JSON parser should return part...
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/20648 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20648 According to offline discussion with @cloud-fan, partial results are not supported at all now. We should refactor the code to clear it and reduce confusion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20648 How about we start this by disallowing the partial results at all, documenting the behaviour and matching the behaviour to R's `read.csv(...)` in case of CSV (in terms of which case is an error or not)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20649#discussion_r170256168 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -297,7 +300,9 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru */ def fieldIndex(name: String): Int = { nameToIndex.getOrElse(name, - throw new IllegalArgumentException(s"""Field "$name" does not exist.""")) + throw new IllegalArgumentException( +s"""Field "$name" does not exist. + |Available fields: ${fieldNamesSet.mkString(",")}""".stripMargin)) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20649#discussion_r170256000 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -271,7 +271,9 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru */ def apply(name: String): StructField = { nameToField.getOrElse(name, - throw new IllegalArgumentException(s"""Field "$name" does not exist.""")) + throw new IllegalArgumentException( +s"""Field "$name" does not exist. + |Available fields: ${fieldNamesSet.mkString(",")}""".stripMargin)) --- End diff -- nit: can we add a space after the comma? I think it can help readability --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20649#discussion_r170256107 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -284,7 +286,8 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru val nonExistFields = names -- fieldNamesSet if (nonExistFields.nonEmpty) { throw new IllegalArgumentException( -s"Field ${nonExistFields.mkString(",")} does not exist.") +s"""Fields ${nonExistFields.mkString(",")} does not exist. --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20649#discussion_r170256030 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -284,7 +286,8 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru val nonExistFields = names -- fieldNamesSet if (nonExistFields.nonEmpty) { throw new IllegalArgumentException( -s"Field ${nonExistFields.mkString(",")} does not exist.") +s"""Fields ${nonExistFields.mkString(",")} does not exist. + |Available fields: ${fieldNamesSet.mkString(",")}""".stripMargin) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20649#discussion_r170256385 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StructTypeSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types.StructType + +class StructTypeSuite extends SparkFunSuite { + + test("SPARK-23462 lookup a single missing field should output existing fields") { +val s = StructType.fromDDL("a INT, b STRING") +val e = intercept[IllegalArgumentException](s("c")).getMessage +assert(e.contains("Available fields: a,b")) + } + + test("SPARK-23462 lookup a set of missing fields should output existing fields") { +val s = StructType.fromDDL("a INT, b STRING") +val e = intercept[IllegalArgumentException](s(Set("a", "c"))).getMessage +assert(e.contains("Available fields: a,b")) + } + + test("SPARK-23462 lookup fieldIndex for missing field should output existing fields") { +val s = StructType.fromDDL("a INT, b STRING") +val e = intercept[IllegalArgumentException](s.fieldIndex("c")).getMessage +assert(e.contains("Available fields: a,b")) + } + --- End diff -- nit: this empty line should be removed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20653 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20663 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20663 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1018/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20663 **[Test build #87631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87631/testReport)** for PR 20663 at commit [`d246df2`](https://github.com/apache/spark/commit/d246df283e9a900cf114fcdf0eee2951b1bd3713). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20663 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPag...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20663 [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in order to avoid redundant code ## What changes were proposed in this pull request? As suggested in #20651, the code is very redundant in `AllStagesPage` and modifying it is a copy-and-paste work. We should avoid such a pattern, which is error prone, and have a cleaner solution which avoids code redundancy. ## How was this patch tested? existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23475_followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20663.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20663 commit d246df283e9a900cf114fcdf0eee2951b1bd3713 Author: Marco GaidoDate: 2018-02-23T11:01:05Z [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in order to avoid redundant code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87628/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20662 **[Test build #87628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87628/testReport)** for PR 20662 at commit [`619a371`](https://github.com/apache/spark/commit/619a37132ef8c23e63c66261ac525c6307cec23c). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87629/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20611 **[Test build #87629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87629/testReport)** for PR 20611 at commit [`22c71dd`](https://github.com/apache/spark/commit/22c71dd13dc0718a0d7c8ca68e6d82eb0cd5b172). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20648 I think we do have an intention to return partial result, but there is no strict definition for it, and seems there is no public document, so it's kind of a new feature. Since this is a non-trivial feature, the first question is: do we want this feature? There is no JIRA ticket requesting this feature, so I feel it is not urgent. We can refactor the code to make it more clearly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20624#discussion_r170226448 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -387,6 +390,101 @@ case class CatalogStatistics( } } +/** + * This class of statistics for a column is used in [[CatalogTable]] to interact with metastore. + */ +case class CatalogColumnStat( + distinctCount: Option[BigInt] = None, + min: Option[String] = None, + max: Option[String] = None, + nullCount: Option[BigInt] = None, + avgLen: Option[Long] = None, + maxLen: Option[Long] = None, + histogram: Option[Histogram] = None) { + + /** + * Returns a map from string to string that can be used to serialize the column stats. + * The key is the name of the column and name of the field (e.g. "colName.distinctCount"), + * and the value is the string representation for the value. + * min/max values are stored as Strings. They can be deserialized using + * [[ColumnStat.fromExternalString]]. --- End diff -- nit: Shall we move `fromExternalString` from `ColumnStat` to `CatalogColumnStat`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20624#discussion_r170226982 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -387,6 +390,101 @@ case class CatalogStatistics( } } +/** + * This class of statistics for a column is used in [[CatalogTable]] to interact with metastore. + */ +case class CatalogColumnStat( + distinctCount: Option[BigInt] = None, + min: Option[String] = None, + max: Option[String] = None, + nullCount: Option[BigInt] = None, + avgLen: Option[Long] = None, + maxLen: Option[Long] = None, + histogram: Option[Histogram] = None) { + + /** + * Returns a map from string to string that can be used to serialize the column stats. + * The key is the name of the column and name of the field (e.g. "colName.distinctCount"), + * and the value is the string representation for the value. + * min/max values are stored as Strings. They can be deserialized using + * [[ColumnStat.fromExternalString]]. + * + * As part of the protocol, the returned map always contains a key called "version". + * In the case min/max values are null (None), they won't appear in the map. + */ + def toMap(colName: String): Map[String, String] = { +val map = new scala.collection.mutable.HashMap[String, String] +map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1") +distinctCount.foreach { v => + map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", v.toString) +} +nullCount.foreach { v => + map.put(s"${colName}.${CatalogColumnStat.KEY_NULL_COUNT}", v.toString) +} +avgLen.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_AVG_LEN}", v.toString) } +maxLen.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_LEN}", v.toString) } +min.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MIN_VALUE}", v) } +max.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_VALUE}", v) } +histogram.foreach { h => + map.put(s"${colName}.${CatalogColumnStat.KEY_HISTOGRAM}", HistogramSerializer.serialize(h)) +} +map.toMap + } + + /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */ + def toPlanStat( --- End diff -- `toPlanStat` is the same as `CatalogStatistics.toPlanStat`. Should we use `toColumnState`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20662 **[Test build #87630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87630/testReport)** for PR 20662 at commit [`bab27e6`](https://github.com/apache/spark/commit/bab27e6ebe415508101b7f09ba9f3f04801aa818). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1017/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20611 **[Test build #87629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87629/testReport)** for PR 20611 at commit [`22c71dd`](https://github.com/apache/spark/commit/22c71dd13dc0718a0d7c8ca68e6d82eb0cd5b172). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170208631 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -77,31 +79,32 @@ class MicroBatchExecution( sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",") val _logicalPlan = analyzedPlan.transform { - case streamingRelation@StreamingRelation(dataSourceV1, sourceName, output) => -toExecutionRelationMap.getOrElseUpdate(streamingRelation, { + case s @ StreamingRelation(dsV1, sourceName, output) => --- End diff -- yea I was touching the code here to retain the `DataSourceV2` instance, and fixed the code style. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170208349 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -23,11 +23,11 @@ import org.apache.spark.sql.execution.SparkPlan object DataSourceV2Strategy extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { -case relation: DataSourceV2Relation => - DataSourceV2ScanExec(relation.output, relation.reader) :: Nil +case r: DataSourceV2Relation => --- End diff -- `relation` is called many times in the next line, renaming it to `r` can shorten the code below. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20662 **[Test build #87628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87628/testReport)** for PR 20662 at commit [`619a371`](https://github.com/apache/spark/commit/619a37132ef8c23e63c66261ac525c6307cec23c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20662 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1016/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20651: [SPARK-23475][UI] Show also skipped stages
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20651 thanks @vanzin I created the PR for the backport. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20662 [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages ## What changes were proposed in this pull request? SPARK-20648 introduced the status `SKIPPED` for the stages. On the UI, previously, skipped stages were shown as `PENDING`; after this change, they are not shown on the UI. The PR introduce a new section in order to show also `SKIPPED` stages in a proper table. Manual backport from to branch-2.3. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23475_2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20662.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20662 commit 619a37132ef8c23e63c66261ac525c6307cec23c Author: Marco GaidoDate: 2018-02-22T19:00:12Z [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages SPARK-20648 introduced the status `SKIPPED` for the stages. On the UI, previously, skipped stages were shown as `PENDING`; after this change, they are not shown on the UI. The PR introduce a new section in order to show also `SKIPPED` stages in a proper table. manual tests Author: Marco Gaido Closes #20651 from mgaido91/SPARK-23475. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20662 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10942: [SPARK-12850] [SQL] Support Bucket Pruning (Predicate Pu...
Github user lonehacker commented on the issue: https://github.com/apache/spark/pull/10942 Thanks @cloud-fan . I couldn't find a JIRA that tracks this feature, could you help with that? This feature is very important for our use case, so it would be great to get any info on when this will be available --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20658 **[Test build #87627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87627/testReport)** for PR 20658 at commit [`d7e03cd`](https://github.com/apache/spark/commit/d7e03cd507b58fda8830b580390f5224ee7c8d65). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20658 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87627/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20658 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20658 **[Test build #87627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87627/testReport)** for PR 20658 at commit [`d7e03cd`](https://github.com/apache/spark/commit/d7e03cd507b58fda8830b580390f5224ee7c8d65). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87624/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87624 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87624/testReport)** for PR 20647 at commit [`dbee281`](https://github.com/apache/spark/commit/dbee2813accd4c8f5937b28eb9142cc6a50f8c6a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20658 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170185948 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -77,31 +79,32 @@ class MicroBatchExecution( sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",") val _logicalPlan = analyzedPlan.transform { - case streamingRelation@StreamingRelation(dataSourceV1, sourceName, output) => -toExecutionRelationMap.getOrElseUpdate(streamingRelation, { + case s @ StreamingRelation(dsV1, sourceName, output) => --- End diff -- If you are touching that specific code then its fine to fix the style, but in general I tend to agree that it makes the diff harder to read and commit harder to back port if you include spurious changes. I've even seen guidelines that specifically prohibit fixing style just to fix style since it obfuscates the history. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org