[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20648
  
> allows the length of tokens are shorter than its schema, putting nulls 
(or NA) into missing fields

Actually I also recalled this is a valid case for csv, and I remember that 
we did this intentionally. How much can we clean up if we want to keep this 
behavior in csv? If it's not a lot, maybe we don't need to bother to do 
refactoring.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170279504
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
 ---
@@ -64,6 +69,41 @@ class KafkaRDDSuite extends SparkFunSuite with 
BeforeAndAfterAll {
 
   private val preferredHosts = LocationStrategies.PreferConsistent
 
+  private def compactLogs(topic: String, partition: Int, messages: 
Array[(String, String)]) {
+val mockTime = new MockTime()
+// LogCleaner in 0.10 version of Kafka is still expecting the old 
TopicAndPartition api
+val logs = new Pool[TopicAndPartition, Log]()
+val logDir = kafkaTestUtils.brokerLogDir
+val dir = new java.io.File(logDir, topic + "-" + partition)
+dir.mkdirs()
+val logProps = new ju.Properties()
+logProps.put(LogConfig.CleanupPolicyProp, LogConfig.Compact)
+logProps.put(LogConfig.MinCleanableDirtyRatioProp, 0.1f: 
java.lang.Float)
+val log = new Log(
+  dir,
+  LogConfig(logProps),
+  0L,
+  mockTime.scheduler,
+  mockTime
+)
+messages.foreach { case (k, v) =>
+val msg = new ByteBufferMessageSet(
--- End diff --

Unindent one level?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170278317
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
 ---
@@ -162,17 +162,22 @@ private[kafka010] class KafkaTestUtils extends 
Logging {
   }
 
   /** Create a Kafka topic and wait until it is propagated to the whole 
cluster */
-  def createTopic(topic: String, partitions: Int): Unit = {
-AdminUtils.createTopic(zkUtils, topic, partitions, 1)
+  def createTopic(topic: String, partitions: Int, config: Properties): 
Unit = {
+AdminUtils.createTopic(zkUtils, topic, partitions, 1, config)
 // wait until metadata is propagated
 (0 until partitions).foreach { p =>
   waitUntilMetadataIsPropagated(topic, p)
 }
   }
 
+  /** Create a Kafka topic and wait until it is propagated to the whole 
cluster */
+  def createTopic(topic: String, partitions: Int): Unit = {
+createTopic(topic, partitions, new Properties)
+  }
+
   /** Create a Kafka topic and wait until it is propagated to the whole 
cluster */
   def createTopic(topic: String): Unit = {
-createTopic(topic, 1)
+createTopic(topic, 1, new Properties)
--- End diff --

Nit: `new Properties()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170278931
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala
 ---
@@ -71,25 +69,62 @@ class CachedKafkaConsumer[K, V] private(
 }
 
 if (!buffer.hasNext()) { poll(timeout) }
-assert(buffer.hasNext(),
+require(buffer.hasNext(),
   s"Failed to get records for $groupId $topic $partition $offset after 
polling for $timeout")
 var record = buffer.next()
 
 if (record.offset != offset) {
   logInfo(s"Buffer miss for $groupId $topic $partition $offset")
   seek(offset)
   poll(timeout)
-  assert(buffer.hasNext(),
+  require(buffer.hasNext(),
 s"Failed to get records for $groupId $topic $partition $offset 
after polling for $timeout")
   record = buffer.next()
-  assert(record.offset == offset,
-s"Got wrong record for $groupId $topic $partition even after 
seeking to offset $offset")
+  require(record.offset == offset,
+s"Got wrong record for $groupId $topic $partition even after 
seeking to offset $offset " +
+  s"got offset ${record.offset} instead. If this is a compacted 
topic, consider enabling " +
+  "spark.streaming.kafka.allowNonConsecutiveOffsets"
+  )
 }
 
 nextOffset = offset + 1
 record
   }
 
+  /**
+   * Start a batch on a compacted topic
+   */
+  def compactedStart(offset: Long, timeout: Long): Unit = {
+logDebug(s"compacted start $groupId $topic $partition starting 
$offset")
+// This seek may not be necessary, but it's hard to tell due to gaps 
in compacted topics
+if (offset != nextOffset) {
+  logInfo(s"Initial fetch for compacted $groupId $topic $partition 
$offset")
+  seek(offset)
+  poll(timeout)
+}
+  }
+
+  /**
+   * Get the next record in the batch from a compacted topic.
+   * Assumes compactedStart has been called first, and ignores gaps.
+   */
+  def compactedNext(timeout: Long): ConsumerRecord[K, V] = {
+if (!buffer.hasNext()) { poll(timeout) }
--- End diff --

Nit: I'd expand this onto two lines


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170278685
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/mocks/MockScheduler.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010.mocks
+
+import java.util.concurrent.TimeUnit
+
+import scala.collection.mutable.PriorityQueue
+
+import kafka.utils.{Scheduler, Time}
+
+/**
+ * A mock scheduler that executes tasks synchronously using a mock time 
instance.
+ * Tasks are executed synchronously when the time is advanced.
+ * This class is meant to be used in conjunction with MockTime.
+ *
+ * Example usage
+ * 
+ *   val time = new MockTime
+ *   time.scheduler.schedule("a task", println("hello world: " + 
time.milliseconds), delay = 1000)
+ *   time.sleep(1001) // this should cause our scheduled task to fire
+ * 
+ *
+ * Incrementing the time to the exact next execution time of a task will 
result in that task
+ * executing (it as if execution itself takes no time).
+ */
+private[kafka010] class MockScheduler(val time: Time) extends Scheduler {
+
+  /* a priority queue of tasks ordered by next execution time */
+  var tasks = new PriorityQueue[MockTask]()
+
+  def isStarted: Boolean = true
+
+  def startup(): Unit = {}
+
+  def shutdown(): Unit = synchronized {
+tasks.foreach(_.fun())
+tasks.clear()
+  }
+
+  /**
+   * Check for any tasks that need to execute. Since this is a mock 
scheduler this check only occurs
+   * when this method is called and the execution happens synchronously in 
the calling thread.
+   * If you are using the scheduler associated with a MockTime instance 
this call
+   * will be triggered automatically.
+   */
+  def tick() {
+this synchronized {
+  val now = time.milliseconds
+  while(!tasks.isEmpty && tasks.head.nextExecution <= now) {
+/* pop and execute the task with the lowest next execution time */
+val curr = tasks.dequeue
+curr.fun()
+/* if the task is periodic, reschedule it and re-enqueue */
+if(curr.periodic) {
+  curr.nextExecution += curr.period
+  this.tasks += curr
+}
+  }
+}
+  }
+
+  def schedule(
+  name: String,
+  fun: () => Unit,
+  delay: Long = 0,
+  period: Long = -1,
+  unit: TimeUnit = TimeUnit.MILLISECONDS) {
+this synchronized {
--- End diff --

I think I'd still write such methods as:

```
def foo(): T = synchronized {
   ...
}
```

This is how the rest of the code base does it (and other Scala code I've 
seen).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170279950
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
 ---
@@ -64,6 +69,41 @@ class KafkaRDDSuite extends SparkFunSuite with 
BeforeAndAfterAll {
 
   private val preferredHosts = LocationStrategies.PreferConsistent
 
+  private def compactLogs(topic: String, partition: Int, messages: 
Array[(String, String)]) {
+val mockTime = new MockTime()
+// LogCleaner in 0.10 version of Kafka is still expecting the old 
TopicAndPartition api
+val logs = new Pool[TopicAndPartition, Log]()
+val logDir = kafkaTestUtils.brokerLogDir
+val dir = new java.io.File(logDir, topic + "-" + partition)
+dir.mkdirs()
+val logProps = new ju.Properties()
+logProps.put(LogConfig.CleanupPolicyProp, LogConfig.Compact)
+logProps.put(LogConfig.MinCleanableDirtyRatioProp, 0.1f: 
java.lang.Float)
--- End diff --

Do you have to 'cast' this to a Java Float object to get it to compile?
`java.lang.Float.valueOf(0.1f)` works too I guess, but equally weird. OK if 
it's required.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170278078
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
 ---
@@ -64,6 +69,41 @@ class KafkaRDDSuite extends SparkFunSuite with 
BeforeAndAfterAll {
 
   private val preferredHosts = LocationStrategies.PreferConsistent
 
+  private def compactLogs(topic: String, partition: Int, messages: 
Array[(String, String)]) {
+val mockTime = new MockTime()
+// LogCleaner in 0.10 version of Kafka is still expecting the old 
TopicAndPartition api
+val logs = new Pool[TopicAndPartition, Log]()
+val logDir = kafkaTestUtils.brokerLogDir
+val dir = new java.io.File(logDir, topic + "-" + partition)
--- End diff --

Import `File`, other `java.*` classes? maybe I'm missing a name conflict.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170277915
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala
 ---
@@ -172,57 +187,138 @@ private[spark] class KafkaRDD[K, V](
 
   override def compute(thePart: Partition, context: TaskContext): 
Iterator[ConsumerRecord[K, V]] = {
 val part = thePart.asInstanceOf[KafkaRDDPartition]
-assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
+require(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
 if (part.fromOffset == part.untilOffset) {
   logInfo(s"Beginning offset ${part.fromOffset} is the same as ending 
offset " +
 s"skipping ${part.topic} ${part.partition}")
   Iterator.empty
 } else {
-  new KafkaRDDIterator(part, context)
+  logInfo(s"Computing topic ${part.topic}, partition ${part.partition} 
" +
+s"offsets ${part.fromOffset} -> ${part.untilOffset}")
+  if (compacted) {
+new CompactedKafkaRDDIterator[K, V](
+  part,
+  context,
+  kafkaParams,
+  useConsumerCache,
+  pollTimeout,
+  cacheInitialCapacity,
+  cacheMaxCapacity,
+  cacheLoadFactor
+)
+  } else {
+new KafkaRDDIterator[K, V](
+  part,
+  context,
+  kafkaParams,
+  useConsumerCache,
+  pollTimeout,
+  cacheInitialCapacity,
+  cacheMaxCapacity,
+  cacheLoadFactor
+)
+  }
 }
   }
+}
 
-  /**
-   * An iterator that fetches messages directly from Kafka for the offsets 
in partition.
-   * Uses a cached consumer where possible to take advantage of prefetching
-   */
-  private class KafkaRDDIterator(
-  part: KafkaRDDPartition,
-  context: TaskContext) extends Iterator[ConsumerRecord[K, V]] {
-
-logInfo(s"Computing topic ${part.topic}, partition ${part.partition} " 
+
-  s"offsets ${part.fromOffset} -> ${part.untilOffset}")
-
-val groupId = 
kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String]
+/**
+ * An iterator that fetches messages directly from Kafka for the offsets 
in partition.
+ * Uses a cached consumer where possible to take advantage of prefetching
+ */
+private class KafkaRDDIterator[K, V](
+  part: KafkaRDDPartition,
+  context: TaskContext,
+  kafkaParams: ju.Map[String, Object],
+  useConsumerCache: Boolean,
+  pollTimeout: Long,
+  cacheInitialCapacity: Int,
+  cacheMaxCapacity: Int,
+  cacheLoadFactor: Float
+) extends Iterator[ConsumerRecord[K, V]] {
+
+  val groupId = 
kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String]
+
+  context.addTaskCompletionListener{ context => closeIfNeeded() }
--- End diff --

This could be `...(_ => closeIfNeeded())`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...

2018-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20572#discussion_r170279150
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala
 ---
@@ -87,47 +89,60 @@ private[spark] class KafkaRDD[K, V](
 }.toArray
   }
 
-  override def count(): Long = offsetRanges.map(_.count).sum
+  override def count(): Long =
+if (compacted) {
+  super.count()
+} else {
+  offsetRanges.map(_.count).sum
+}
 
   override def countApprox(
   timeout: Long,
   confidence: Double = 0.95
-  ): PartialResult[BoundedDouble] = {
-val c = count
-new PartialResult(new BoundedDouble(c, 1.0, c, c), true)
-  }
+  ): PartialResult[BoundedDouble] =
+if (compacted) {
+  super.countApprox(timeout, confidence)
+} else {
+  val c = count
+  new PartialResult(new BoundedDouble(c, 1.0, c, c), true)
+}
 
-  override def isEmpty(): Boolean = count == 0L
+  override def isEmpty(): Boolean =
+if (compacted) {
+  super.isEmpty()
+} else {
+  count == 0L
+}
 
-  override def take(num: Int): Array[ConsumerRecord[K, V]] = {
-val nonEmptyPartitions = this.partitions
-  .map(_.asInstanceOf[KafkaRDDPartition])
-  .filter(_.count > 0)
+  override def take(num: Int): Array[ConsumerRecord[K, V]] =
+if (compacted) {
+  super.take(num)
+} else {
+  val nonEmptyPartitions = this.partitions
+.map(_.asInstanceOf[KafkaRDDPartition])
+.filter(_.count > 0)
 
-if (num < 1 || nonEmptyPartitions.isEmpty) {
-  return new Array[ConsumerRecord[K, V]](0)
-}
+  if (num < 1 || nonEmptyPartitions.isEmpty) {
--- End diff --

I guess you could check `num < 1` before the map/filter, but it's trivial.
You could write `return Array.empty[ConsumerRecord[K,V]]` too; again 
trivial.
Since this is existing code I could see not touching it as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20664#discussion_r170279656
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with 
SharedSparkContext {
 }.collect()
   }
 
+  test("SPARK-23496: order of input partitions can result in severe skew 
in coalesce") {
--- End diff --

I see, thanks, sorry, I missed it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-23 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20553
  
IIUC the `spark.kubernetes.executor.cores` here is just a special case for 
`spark.executor.cores`, for k8s backend, you shall still have to handle float 
values if you're to read the value of `spark.kubernetes.executor.cores`, for 
instance, in dynamic allocation. If that is the case here, I don't see much 
benefit of bringing in a new conf here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-23 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20553
  
also cc @cloud-fan @jerryshao 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-23 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20553
  
How do we plan to support dynamic allocation with k8s? Should we read 
`spark.executor.cores` or `spark.kubernetes.executor.cores` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...

2018-02-23 Thread ala
Github user ala commented on a diff in the pull request:

https://github.com/apache/spark/pull/20664#discussion_r170277224
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with 
SharedSparkContext {
 }.collect()
   }
 
+  test("SPARK-23496: order of input partitions can result in severe skew 
in coalesce") {
--- End diff --

The test is in fact deterministic. The seed is already fixed here: 
https://github.com/apache/spark/blob/049f243c59737699fee54fdc9d65cbd7c788032a/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L163


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20648
  
+1 for disallowing it anyway if it was Wenchen's opinion too. Please go 
ahead. Will help double check anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20648
  
Yup, +1 for starting this by disallowing but up to my knowledge R's 
read.csv allows then the legnth of tokens are shorter then its schema, putting 
nulls (or NA) into missing fields, as a valid case.

I was thinking of disallowing the partial results but allowing the tokens 
less than the schema as a valid case in CSV.

I need to double check R's read.csv behaviour and the current behaviour but 
it was roughly my thought.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20664#discussion_r170271696
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with 
SharedSparkContext {
 }.collect()
   }
 
+  test("SPARK-23496: order of input partitions can result in severe skew 
in coalesce") {
--- End diff --

this test looks to me as a good candidate for flakiness, since we are are 
picking random numbers. Can we set the seed in order to avoid this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20648
  
@HyukjinKwon From the document of `DataFrameReader.csv`, the behavior of 
CSV reader isn't consistent with the document.

```
`PERMISSIVE` : sets other fields to `null` when it meets a corrupted 
record, and puts
the malformed string into a field configured by `columnNameOfCorruptRecord`.
```

With respect to the document, I think we may need to disable it for CSV 
too. What do you think?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87630/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20662
  
**[Test build #87630 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87630/testReport)**
 for PR 20662 at commit 
[`bab27e6`](https://github.com/apache/spark/commit/bab27e6ebe415508101b7f09ba9f3f04801aa818).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20664
  
**[Test build #87632 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87632/testReport)**
 for PR 20664 at commit 
[`6d67dfc`](https://github.com/apache/spark/commit/6d67dfc1d4c012492b97873beaac5a7cbfd6f55a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20664
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1019/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20664
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...

2018-02-23 Thread ala
Github user ala commented on the issue:

https://github.com/apache/spark/pull/20664
  
@hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...

2018-02-23 Thread ala
GitHub user ala opened a pull request:

https://github.com/apache/spark/pull/20664

[SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed 
by the order of input partitions

## What changes were proposed in this pull request?

The algorithm in `DefaultPartitionCoalescer.setupGroups` is responsible for 
picking preferred locations for coalesced partitions. It analyzes the preferred 
locations of input partitions. It starts by trying to create one partition for 
each unique location in the input. However, if the the requested number of 
coalesced partitions is higher that the number of unique locations, it has to 
pick duplicate locations.

Previously, the duplicate locations would be picked by iterating over the 
input partitions in order, and copying their preferred locations to coalesced 
partitions. If the input partitions were clustered by location, this could 
result in severe skew.

With the fix, instead of iterating over the list of input partitions in 
order, we pick them at random. It's not perfectly balanced, but it's much 
better.

## How was this patch tested?

Unit test reproducing the behavior was added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ala/spark SPARK-23496

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20664.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20664


commit 6d67dfc1d4c012492b97873beaac5a7cbfd6f55a
Author: Ala Luszczak 
Date:   2018-02-23T14:37:19Z

Fix SPARK-23496.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20648
  
Yup, it's unsupported in JSON but CSV supports it. Do you mean to disallow 
CSV too, or simply clean up JSON code path?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...

2018-02-23 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20649#discussion_r170265519
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StructTypeSuite.scala ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types.StructType
+
+class StructTypeSuite extends SparkFunSuite {
+
+  test("SPARK-23462 lookup a single missing field should output existing 
fields") {
--- End diff --

nit: Would it be better to combine these three tests into one?  It could 
reduce the total test time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20648
  
I'll close this PR and create another PR to refactor JSON parser and 
related codes. Thanks @cloud-fan and @HyukjinKwon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20648: [SPARK-23448][SQL] JSON parser should return part...

2018-02-23 Thread viirya
Github user viirya closed the pull request at:

https://github.com/apache/spark/pull/20648


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20648
  
According to offline discussion with @cloud-fan, partial results are not 
supported at all now. We should refactor the code to clear it and reduce 
confusion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20648
  
How about we start this by disallowing the partial results at all, 
documenting the behaviour and matching the behaviour to R's `read.csv(...)` in 
case of CSV (in terms of which case is an error or not)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20649#discussion_r170256168
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -297,7 +300,9 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
*/
   def fieldIndex(name: String): Int = {
 nameToIndex.getOrElse(name,
-  throw new IllegalArgumentException(s"""Field "$name" does not 
exist."""))
+  throw new IllegalArgumentException(
+s"""Field "$name" does not exist.
+   |Available fields: 
${fieldNamesSet.mkString(",")}""".stripMargin))
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20649#discussion_r170256000
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -271,7 +271,9 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
*/
   def apply(name: String): StructField = {
 nameToField.getOrElse(name,
-  throw new IllegalArgumentException(s"""Field "$name" does not 
exist."""))
+  throw new IllegalArgumentException(
+s"""Field "$name" does not exist.
+   |Available fields: 
${fieldNamesSet.mkString(",")}""".stripMargin))
--- End diff --

nit: can we add a space after the comma? I think it can help readability


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20649#discussion_r170256107
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -284,7 +286,8 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
 val nonExistFields = names -- fieldNamesSet
 if (nonExistFields.nonEmpty) {
   throw new IllegalArgumentException(
-s"Field ${nonExistFields.mkString(",")} does not exist.")
+s"""Fields ${nonExistFields.mkString(",")} does not exist.
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20649#discussion_r170256030
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -284,7 +286,8 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
 val nonExistFields = names -- fieldNamesSet
 if (nonExistFields.nonEmpty) {
   throw new IllegalArgumentException(
-s"Field ${nonExistFields.mkString(",")} does not exist.")
+s"""Fields ${nonExistFields.mkString(",")} does not exist.
+   |Available fields: 
${fieldNamesSet.mkString(",")}""".stripMargin)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20649: [SPARK-23462][SQL] improve missing field error me...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20649#discussion_r170256385
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StructTypeSuite.scala ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types.StructType
+
+class StructTypeSuite extends SparkFunSuite {
+
+  test("SPARK-23462 lookup a single missing field should output existing 
fields") {
+val s = StructType.fromDDL("a INT, b STRING")
+val e = intercept[IllegalArgumentException](s("c")).getMessage
+assert(e.contains("Available fields: a,b"))
+  }
+
+  test("SPARK-23462 lookup a set of missing fields should output existing 
fields") {
+val s = StructType.fromDDL("a INT, b STRING")
+val e = intercept[IllegalArgumentException](s(Set("a", 
"c"))).getMessage
+assert(e.contains("Available fields: a,b"))
+  }
+
+  test("SPARK-23462 lookup fieldIndex for missing field should output 
existing fields") {
+val s = StructType.fromDDL("a INT, b STRING")
+val e = 
intercept[IllegalArgumentException](s.fieldIndex("c")).getMessage
+assert(e.contains("Available fields: a,b"))
+  }
+
--- End diff --

nit: this empty line should be removed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20653
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20663
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20663
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1018/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20663
  
**[Test build #87631 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87631/testReport)**
 for PR 20663 at commit 
[`d246df2`](https://github.com/apache/spark/commit/d246df283e9a900cf114fcdf0eee2951b1bd3713).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in or...

2018-02-23 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20663
  
cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20663: [SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPag...

2018-02-23 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20663

[SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in order to avoid 
redundant code

## What changes were proposed in this pull request?

As suggested in #20651, the code is very redundant in `AllStagesPage` and 
modifying it is a copy-and-paste work. We should avoid such a pattern, which is 
error prone, and have a cleaner solution which avoids code redundancy.

## How was this patch tested?

existing UTs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23475_followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20663.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20663


commit d246df283e9a900cf114fcdf0eee2951b1bd3713
Author: Marco Gaido 
Date:   2018-02-23T11:01:05Z

[SPARK-23475][UI][FOLLOWUP] Refactor AllStagesPage in order to avoid 
redundant code




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87628/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20662
  
**[Test build #87628 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87628/testReport)**
 for PR 20662 at commit 
[`619a371`](https://github.com/apache/spark/commit/619a37132ef8c23e63c66261ac525c6307cec23c).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20611
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20611
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87629/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20611
  
**[Test build #87629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87629/testReport)**
 for PR 20611 at commit 
[`22c71dd`](https://github.com/apache/spark/commit/22c71dd13dc0718a0d7c8ca68e6d82eb0cd5b172).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20648
  
I think we do have an intention to return partial result, but there is no 
strict definition for it, and seems there is no public document, so it's kind 
of a new feature.

Since this is a non-trivial feature, the first question is: do we want this 
feature? There is no JIRA ticket requesting this feature, so I feel it is not 
urgent. We can refactor the code to make it more clearly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

2018-02-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20624#discussion_r170226448
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -387,6 +390,101 @@ case class CatalogStatistics(
   }
 }
 
+/**
+ * This class of statistics for a column is used in [[CatalogTable]] to 
interact with metastore.
+ */
+case class CatalogColumnStat(
+  distinctCount: Option[BigInt] = None,
+  min: Option[String] = None,
+  max: Option[String] = None,
+  nullCount: Option[BigInt] = None,
+  avgLen: Option[Long] = None,
+  maxLen: Option[Long] = None,
+  histogram: Option[Histogram] = None) {
+
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the column and name of the field (e.g. 
"colName.distinctCount"),
+   * and the value is the string representation for the value.
+   * min/max values are stored as Strings. They can be deserialized using
+   * [[ColumnStat.fromExternalString]].
--- End diff --

nit: Shall we move `fromExternalString` from `ColumnStat` to 
`CatalogColumnStat`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

2018-02-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20624#discussion_r170226982
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -387,6 +390,101 @@ case class CatalogStatistics(
   }
 }
 
+/**
+ * This class of statistics for a column is used in [[CatalogTable]] to 
interact with metastore.
+ */
+case class CatalogColumnStat(
+  distinctCount: Option[BigInt] = None,
+  min: Option[String] = None,
+  max: Option[String] = None,
+  nullCount: Option[BigInt] = None,
+  avgLen: Option[Long] = None,
+  maxLen: Option[Long] = None,
+  histogram: Option[Histogram] = None) {
+
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the column and name of the field (e.g. 
"colName.distinctCount"),
+   * and the value is the string representation for the value.
+   * min/max values are stored as Strings. They can be deserialized using
+   * [[ColumnStat.fromExternalString]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   * In the case min/max values are null (None), they won't appear in the 
map.
+   */
+  def toMap(colName: String): Map[String, String] = {
+val map = new scala.collection.mutable.HashMap[String, String]
+map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1")
+distinctCount.foreach { v =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", 
v.toString)
+}
+nullCount.foreach { v =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_NULL_COUNT}", 
v.toString)
+}
+avgLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_AVG_LEN}", v.toString) }
+maxLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_LEN}", v.toString) }
+min.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MIN_VALUE}", v) }
+max.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_VALUE}", v) }
+histogram.foreach { h =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_HISTOGRAM}", 
HistogramSerializer.serialize(h))
+}
+map.toMap
+  }
+
+  /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */
+  def toPlanStat(
--- End diff --

`toPlanStat` is the same as `CatalogStatistics.toPlanStat`. Should we use 
`toColumnState`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20662
  
**[Test build #87630 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87630/testReport)**
 for PR 20662 at commit 
[`bab27e6`](https://github.com/apache/spark/commit/bab27e6ebe415508101b7f09ba9f3f04801aa818).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1017/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20611
  
**[Test build #87629 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87629/testReport)**
 for PR 20611 at commit 
[`22c71dd`](https://github.com/apache/spark/commit/22c71dd13dc0718a0d7c8ca68e6d82eb0cd5b172).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170208631
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -77,31 +79,32 @@ class MicroBatchExecution(
   
sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",")
 
 val _logicalPlan = analyzedPlan.transform {
-  case streamingRelation@StreamingRelation(dataSourceV1, sourceName, 
output) =>
-toExecutionRelationMap.getOrElseUpdate(streamingRelation, {
+  case s @ StreamingRelation(dsV1, sourceName, output) =>
--- End diff --

yea I was touching the code here to retain the `DataSourceV2` instance, and 
fixed the code style.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170208349
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -23,11 +23,11 @@ import org.apache.spark.sql.execution.SparkPlan
 
 object DataSourceV2Strategy extends Strategy {
   override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-case relation: DataSourceV2Relation =>
-  DataSourceV2ScanExec(relation.output, relation.reader) :: Nil
+case r: DataSourceV2Relation =>
--- End diff --

`relation` is called many times in the next line, renaming it to `r` can 
shorten the code below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20662
  
**[Test build #87628 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87628/testReport)**
 for PR 20662 at commit 
[`619a371`](https://github.com/apache/spark/commit/619a37132ef8c23e63c66261ac525c6307cec23c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20662
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1016/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20651: [SPARK-23475][UI] Show also skipped stages

2018-02-23 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20651
  
thanks @vanzin I created the PR for the backport. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped...

2018-02-23 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20662

[SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

## What changes were proposed in this pull request?

SPARK-20648 introduced the status `SKIPPED` for the stages. On the UI, 
previously, skipped stages were shown as `PENDING`; after this change, they are 
not shown on the UI.

The PR introduce a new section in order to show also `SKIPPED` stages in a 
proper table.

Manual backport from to branch-2.3.


## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23475_2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20662.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20662


commit 619a37132ef8c23e63c66261ac525c6307cec23c
Author: Marco Gaido 
Date:   2018-02-22T19:00:12Z

[SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

SPARK-20648 introduced the status `SKIPPED` for the stages. On the UI, 
previously, skipped stages were shown as `PENDING`; after this change, they are 
not shown on the UI.

The PR introduce a new section in order to show also `SKIPPED` stages in a 
proper table.

manual tests

Author: Marco Gaido 

Closes #20651 from mgaido91/SPARK-23475.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages

2018-02-23 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20662
  
cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10942: [SPARK-12850] [SQL] Support Bucket Pruning (Predicate Pu...

2018-02-23 Thread lonehacker
Github user lonehacker commented on the issue:

https://github.com/apache/spark/pull/10942
  
Thanks @cloud-fan . I couldn't find a JIRA that tracks this feature, could 
you help with that? This feature is very important for our use case, so it 
would be great to get any info on when this will be available


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20658
  
**[Test build #87627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87627/testReport)**
 for PR 20658 at commit 
[`d7e03cd`](https://github.com/apache/spark/commit/d7e03cd507b58fda8830b580390f5224ee7c8d65).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20658
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87627/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20658
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20647
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20658
  
**[Test build #87627 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87627/testReport)**
 for PR 20658 at commit 
[`d7e03cd`](https://github.com/apache/spark/commit/d7e03cd507b58fda8830b580390f5224ee7c8d65).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20647
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87624/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20647
  
**[Test build #87624 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87624/testReport)**
 for PR 20647 at commit 
[`dbee281`](https://github.com/apache/spark/commit/dbee2813accd4c8f5937b28eb9142cc6a50f8c6a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

2018-02-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20658
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-23 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170185948
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -77,31 +79,32 @@ class MicroBatchExecution(
   
sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",")
 
 val _logicalPlan = analyzedPlan.transform {
-  case streamingRelation@StreamingRelation(dataSourceV1, sourceName, 
output) =>
-toExecutionRelationMap.getOrElseUpdate(streamingRelation, {
+  case s @ StreamingRelation(dsV1, sourceName, output) =>
--- End diff --

If you are touching that specific code then its fine to fix the style, but 
in general I tend to agree that it makes the diff harder to read and commit 
harder to back port if you include spurious changes.

I've even seen guidelines that specifically prohibit fixing style just to 
fix style since it obfuscates the history.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2