[GitHub] spark pull request #21982: [SPARK-23911][SQL] Add aggregate function.

2018-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21982


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21982: [SPARK-23911][SQL] Add aggregate function.

2018-08-04 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21982
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21102
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21102
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94221/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21102
  
**[Test build #94221 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94221/testReport)**
 for PR 21102 at commit 
[`6fba1ee`](https://github.com/apache/spark/commit/6fba1ee8c3525a6f34bf5580737d067a8f0d976d).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ArrayIntersect(left: Expression, right: Expression) extends 
ArraySetLike`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21977
  
**[Test build #94229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94229/testReport)**
 for PR 21977 at commit 
[`0c0ff92`](https://github.com/apache/spark/commit/0c0ff92693945f3c5ae63f60af94b88281be0c32).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94229/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21977
  
**[Test build #94229 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94229/testReport)**
 for PR 21977 at commit 
[`0c0ff92`](https://github.com/apache/spark/commit/0c0ff92693945f3c5ae63f60af94b88281be0c32).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1812/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@mallman: I've rebased on top of your changes and pushed. I'm seeing the 
following:

Given the following schema:
```
root
 |-- id: integer (nullable = true)
 |-- name: struct (nullable = true)
 ||-- first: string (nullable = true)
 ||-- middle: string (nullable = true)
 ||-- last: string (nullable = true)
 |-- address: string (nullable = true)
 |-- pets: integer (nullable = true)
 |-- friends: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- first: string (nullable = true)
 |||-- middle: string (nullable = true)
 |||-- last: string (nullable = true)
 |-- relatives: map (nullable = true)
 ||-- key: string
 ||-- value: struct (valueContainsNull = true)
 |||-- first: string (nullable = true)
 |||-- middle: string (nullable = true)
 |||-- last: string (nullable = true)
 |-- p: integer (nullable = true)
```

The query: `select name.middle, address from temp` throws:
```
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file 
file:/private/var/folders/ss/cw601dzn59b2nygs8k1bs78x75lhr0/T/spark-cab140ca-cbba-4dc1-9fe5-6ae739dab70a/contacts/p=2/part-0-91d2abf5-625f-4080-b34c-e373b89c9895-c000.snappy.parquet
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
  ... 20 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
  at java.util.ArrayList.get(ArrayList.java:433)
  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
  at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:97)
  at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:92)
  at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278)
  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
  at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
  at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
  ... 25 more
```

No root cause yet, but I noticed this while working with the unit tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21889
  
**[Test build #94228 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94228/testReport)**
 for PR 21889 at commit 
[`92901da`](https://github.com/apache/spark/commit/92901da3785ce94db501a4c3d9be6316cfbf29a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20838
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94219/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20838
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20838
  
**[Test build #94219 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94219/testReport)**
 for PR 20838 at commit 
[`e41a8cc`](https://github.com/apache/spark/commit/e41a8cc303eb428fe99bacae3b9c87be57b03125).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #94227 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94227/testReport)**
 for PR 16677 at commit 
[`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1811/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16677
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...

2018-08-04 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21917#discussion_r207721681
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/OffsetWithRecordScannerSuite.scala
 ---
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.{util => ju}
+
+import scala.collection.JavaConverters._
+
+import org.apache.kafka.clients.consumer.{Consumer, ConsumerRecord, 
ConsumerRecords}
+import org.apache.kafka.common.TopicPartition
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.internal.Logging
+
+class OffsetWithRecordScannerSuite
+  extends SparkFunSuite
+with Logging {
+
+  class OffsetWithRecordScannerMock[K, V](records: 
List[Option[ConsumerRecord[K, V]]])
+extends OffsetWithRecordScanner[K, V](
+  Map[String, Object]("isolation.level" -> "read_committed").asJava, 
1, 1, 0.75F, true) {
+var i = -1
+override protected def getNext(c: KafkaDataConsumer[K, V]): 
Option[ConsumerRecord[K, V]] = {
+  i = i + 1
+  records(i)
+}
+
+  }
+
+  val emptyConsumerRecords = new ConsumerRecords[String, 
String](ju.Collections.emptyMap())
+  val tp = new TopicPartition("topic", 0)
+
+  test("Rewinder construction should fail if isolation level isn set to 
read_committed") {
+intercept[IllegalStateException] {
+  new OffsetWithRecordScanner[String, String](
+Map[String, Object]("isolation.level" -> 
"read_uncommitted").asJava, 1, 1, 0.75F, true)
+}
+  }
+
+  test("Rewinder construction shouldn't fail if isolation level isn't 
set") {
+  assert(new OffsetWithRecordScanner[String, String](
+Map[String, Object]().asJava, 1, 1, 0.75F, true) != null)
+  }
+
+  test("Rewinder construction should fail if isolation level isn't set to 
committed") {
+intercept[IllegalStateException] {
+  new OffsetWithRecordScanner[String, String](
+Map[String, Object]("isolation.level" -> 
"read_uncommitted").asJava, 1, 1, 0.75F, true)
+}
+  }
+
+  test("Rewind should return the proper count.") {
+var scanner = new OffsetWithRecordScannerMock[String, String](
+  records(Some(0), Some(1), Some(2), Some(3)))
+val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 2)
+assert(offset === 2)
+assert(size === 2)
+  }
+
+  test("Rewind should return the proper count with gap") {
+var scanner = new OffsetWithRecordScannerMock[String, String](
+  records(Some(0), Some(1), Some(3), Some(4), Some(5)))
+val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 3)
+assert(offset === 4)
+assert(size === 3)
+  }
+
+  test("Rewind should return the proper count for the end of the 
iterator") {
+var scanner = new OffsetWithRecordScannerMock[String, String](
+  records(Some(0), Some(1), Some(2), None))
+val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 3)
+assert(offset === 3)
+assert(size === 3)
+  }
+
+  test("Rewind should return the proper count missing data") {
+var scanner = new OffsetWithRecordScannerMock[String, String](
+  records(Some(0), None))
+val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 2)
+assert(offset === 1)
+assert(size === 1)
+  }
+
+  test("Rewind should return the proper count without data") {
+var scanner = new OffsetWithRecordScannerMock[String, String](
+  records(None))
+val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 2)
+assert(offset === 0)
+assert(size === 0)
+  }
+
+  private def records(offsets: Option[Long]*) = {
+offsets.map(o => o.map(new ConsumerRecord("topic", 0, _, "k", 
"v"))).toList
+  }
+}
--- End diff --

Th

[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...

2018-08-04 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21917#discussion_r207721657
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/OffsetRange.scala
 ---
@@ -90,21 +90,23 @@ final class OffsetRange private(
 val topic: String,
 val partition: Int,
 val fromOffset: Long,
-val untilOffset: Long) extends Serializable {
+val untilOffset: Long,
+val recordNumber: Long) extends Serializable {
--- End diff --

Does mima actually complain about binary compatibility if you just make 
recordNumber count?  It's just an accessor either way...

If so, and you have to do this, I'd name this recordCount consistently 
throughout.  Number could refer to a lot of things that aren't counts.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94220/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #94220 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94220/testReport)**
 for PR 16677 at commit 
[`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...

2018-08-04 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21917#discussion_r207721492
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -191,6 +211,11 @@ private[kafka010] class InternalKafkaConsumer[K, V](
 buffer.previous()
   }
 
+  def seekAndPoll(offset: Long, timeout: Long): ConsumerRecords[K, V] = {
--- End diff --

Is this used anywhere?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...

2018-08-04 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21917#discussion_r207721482
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -183,6 +187,22 @@ private[kafka010] class InternalKafkaConsumer[K, V](
 record
   }
 
+  /**
+   * Similar to compactedStart but will return None if poll doesn't
--- End diff --

Did you mean compactedNext?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...

2018-08-04 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/21917#discussion_r207721435
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
 ---
@@ -223,17 +240,46 @@ private[spark] class DirectKafkaInputDStream[K, V](
 }.getOrElse(offsets)
   }
 
-  override def compute(validTime: Time): Option[KafkaRDD[K, V]] = {
-val untilOffsets = clamp(latestOffsets())
-val offsetRanges = untilOffsets.map { case (tp, uo) =>
-  val fo = currentOffsets(tp)
-  OffsetRange(tp.topic, tp.partition, fo, uo)
+  /**
+   * Return the offset range. For non consecutive offset the last offset 
must have record.
+   * If offsets have missing data (transaction marker or abort), increases 
the
+   * range until we get the requested number of record or no more records.
+   * Because we have to iterate over all the records in this case,
+   * we also return the total number of records.
+   * @param offsets the target range we would like if offset were continue
+   * @return (totalNumberOfRecords, updated offset)
+   */
+  private def alignRanges(offsets: Map[TopicPartition, Long]): 
Iterable[OffsetRange] = {
+if (nonConsecutive) {
+  val localRw = rewinder()
+  val localOffsets = currentOffsets
+  context.sparkContext.parallelize(offsets.toList).mapPartitions(tpos 
=> {
--- End diff --

Because this isn't a kafka rdd, it isn't going to take advantage of 
preferred locations, which means it's going to create cached consumers on 
different executors.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
> How do you know that offset 4 isn't just lost because poll failed?

By failed, you mean returned an empty collection after timing out, even 
though records should be available?  You don't.  You also don't know that it 
isn't just lost because kafka skipped a message.  AFAIK from the information 
you have from a kafka consumer, once you start allowing gaps in offsets, you 
don't know.

I understand your point, but even under your proposal you have no guarantee 
that the poll won't work in your first pass during RDD construction, and then 
fail on the executor during computation, right?

> The issue with your proposal is that SeekToEnd gives you the last offset 
which might not be the last record.

Have you tested comparing the results of consumer.endOffsets for consumers 
with different isolation levels?

Your proposal might end up being the best approach anyway, just because of 
the unfortunate effect of StreamInputInfo and count, but I want to make sure we 
think this through.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22000: [SPARK-25025][SQL] Remove the default value of isAll in ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22000
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1810/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22000: [SPARK-25025][SQL] Remove the default value of isAll in ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22000
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22000: [SPARK-25025][SQL] Remove the default value of isAll in ...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22000
  
**[Test build #94226 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94226/testReport)**
 for PR 22000 at commit 
[`5765098`](https://github.com/apache/spark/commit/57650985eca7de23c6c8cf2812d673702c8e1f3d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22000: [SPARK-25025][SQL] Remove the default value of is...

2018-08-04 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22000

[SPARK-25025][SQL] Remove the default value of isAll in INTERSECT/EXCEPT

## What changes were proposed in this pull request?

Having the default value of isAll in the logical plan nodes 
INTERSECT/EXCEPT could introduce bugs when the callers are not aware of it. 
This PR removes the default value and makes caller explicitly specify them.

## How was this patch tested?
This is a refactoring changes. Existing tests test the functionality 
already.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25025

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22000.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22000


commit 57650985eca7de23c6c8cf2812d673702c8e1f3d
Author: Dilip Biswal 
Date:   2018-08-04T21:32:57Z

[SPARK-25025] Remove the default value of isAll in INTERSECT/EXCEPT




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread QuentinAmbard
Github user QuentinAmbard commented on the issue:

https://github.com/apache/spark/pull/21917
  
If you are doing it in advance you'll change the range, so for example you 
read until 3 and don't get any extra results. Maybe it's because of a 
transaction offset, maybe another issue, it's ok in both cases.
The big difference is that the next batch will restart from offset 3 and 
poll from this value. If seek to 3 and poll get you another result (for example 
6) then everything is fine  it's not a data loss it's just a gap.
The issue with your proposal is that SeekToEnd gives you the last offset 
which might not be the last record.
So in your example if last offset is 5 and after a few poll the last record 
you get is 3 what do you do, continue and execute the next batch from 5? How do 
you know that offset 4 isn't just lost because poll failed?
The only way to know that would be to get a record with an offset higher 
than 5. In this case you know it's just a gap. 
But if the message you are reading is the last of the topic you won't have 
records higher than 3, do you can't tell if it's a poll failure or an empty 
offset because of the transaction commit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #94225 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94225/testReport)**
 for PR 21898 at commit 
[`53aa316`](https://github.com/apache/spark/commit/53aa316bb10344fdec3ed4378f9386a3400fa8cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1809/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/21898
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/21898
  
It might be caused by jenkins workload. The barrier test also failed due to 
timeout: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94214/testReport/org.apache.spark.scheduler/BarrierTaskContextSuite/throw_exception_if_barrier___call_doesn_t_happen_on_every_task/.
 This means we should increase the timeout in our test to handle heavy workload 
scenarios.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
If the last offset in the range as calculated by the driver is 5, and on 
the executor all you can poll up to after a repeated attempt is 3, and the user 
already told you to allowNonConsecutiveOffsets... then you're done, no error.

Why does it matter if you do this logic when you're reading all the 
messages in advance and counting, or when you're actually computing? 

To put it another way, this PR is a lot of code change and refactoring, why 
not just change the logic of e.g. how CompactedKafkaRDDIterator interacts with 
compactedNext?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21320
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1808/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21320
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21320
  
**[Test build #94224 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94224/testReport)**
 for PR 21320 at commit 
[`a87b589`](https://github.com/apache/spark/commit/a87b589c70af69d25424b03b52e29165ab4e0b5f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-04 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21889#discussion_r207719862
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -0,0 +1,205 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.execution.FileSchemaPruningTest
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ParquetSchemaPruningSuite
+extends QueryTest
+with ParquetTest
+with FileSchemaPruningTest
+with SharedSQLContext {
+  case class FullName(first: String, middle: String, last: String)
+  case class Contact(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map())
+
+  val janeDoe = FullName("Jane", "X.", "Doe")
+  val johnDoe = FullName("John", "Y.", "Doe")
+  val susanSmith = FullName("Susan", "Z.", "Smith")
+
+  val contacts =
+Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith),
+  relatives = Map("brother" -> johnDoe)) ::
+Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> 
janeDoe)) :: Nil
+
+  case class Name(first: String, last: String)
+  case class BriefContact(id: Int, name: Name, address: String)
+
+  val briefContacts =
+BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") ::
+BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil
+
+  case class ContactWithDataPartitionColumn(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map(),
+p: Int)
+
+  case class BriefContactWithDataPartitionColumn(id: Int, name: Name, 
address: String, p: Int)
+
+  val contactsWithDataPartitionColumn =
+contacts.map { case Contact(id, name, address, pets, friends, 
relatives) =>
+  ContactWithDataPartitionColumn(id, name, address, pets, friends, 
relatives, 1) }
+  val briefContactsWithDataPartitionColumn =
+briefContacts.map { case BriefContact(id, name, address) =>
+  BriefContactWithDataPartitionColumn(id, name, address, 2) }
+
+  testSchemaPruning("select a single complex field") {
+val query = sql("select name.middle from contacts order by id")
+checkScanSchemata(query, "struct>")
+checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: 
Nil)
+  }
+
+  testSchemaPruning("select a single complex field and its parent struct") 
{
+val query = sql("select name.middle, name from contacts order by id")
+checkScanSchemata(query, 
"struct>")
+checkAnswer(query,
+  Row("X.", Row("Jane", "X.", "Doe")) ::
+  Row("Y.", Row("John", "Y.", "Doe")) ::
+  Row(null, Row("Janet", null, "Jones")) ::
+  Row(null, Row("Jim", null, "Jones")) ::
+  Nil)
+  }
+
+  testSchemaPruning("select a single complex field array and its parent 
struct array") {
+val query = sql("select friends.middle, friends from contacts where 
p=1 order by id")
+checkScanSchemata(query,
+  
"struct>>")
+checkAnswer(query,
+  Row(Array("Z."), Array(Row("Susan", "Z.", "Smith"))) ::
+  Row(Array.empty[String], Array.empty[Row]) ::
+  Nil)
+  }
+
+  testSchemaPruning("select a single complex field from a map entry and 
its parent map entry") {
+val query =
+  sql("select relatives[\"brother\"].middle, relatives[\"brother\"] 
from contacts where p=1 " +
+"order by id")
+checkScanSchemata(query,
+  
"struct>>")
+checkAnswer(query,
+  Row(

[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread QuentinAmbard
Github user QuentinAmbard commented on the issue:

https://github.com/apache/spark/pull/21917
  
I'm not sure to understand your point. The cause of the gap doesn't matter, 
we just want to stop on an existing offset to be able to poll it. It can be 
because of a transaction marker, a transaction abort or even just a temporary 
poll failure it's not relevant in this case.
The driver is smart enough to be able to restart from any Offset, even in 
the middle of a transaction (abort or not)
The issue with gap at the end is that you can't know if it's a gap or if 
the poll failed. 
For example SeekToEnd gives you 5 but the last record you get is 3 and 
there is no way to know if 4 is missing or just an offset gap.
How could we fix that in a different way?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21999
  
**[Test build #94223 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94223/testReport)**
 for PR 21999 at commit 
[`8de1465`](https://github.com/apache/spark/commit/8de14652b838ea053f430d17129c73c85cb2e0cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21999
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94222/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21999
  
**[Test build #94222 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94222/testReport)**
 for PR 21999 at commit 
[`5b568c6`](https://github.com/apache/spark/commit/5b568c67951f6f620cd0d549fdbd0c25f819fe43).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public abstract class AbstractLauncher> `
  * `case class ArrayFilter(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21999
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21999
  
**[Test build #94222 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94222/testReport)**
 for PR 21999 at commit 
[`5b568c6`](https://github.com/apache/spark/commit/5b568c67951f6f620cd0d549fdbd0c25f819fe43).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21999
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21999
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21999: [WIP][SQL] Flattening nested structures

2018-08-04 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21999

[WIP][SQL] Flattening nested structures

## What changes were proposed in this pull request?

In the PR, I propose new unary expression `StructFlatten` for flattening 
nested structures. For example, a dataset with the schema:

```
root
 |-- st: struct (nullable = false)
 ||-- col1: long (nullable = false)
 ||-- col2: struct (nullable = false)
 |||-- col3: long (nullable = false)
```
by applying `struct_flatten(st)` it will be transformed to:

```
root
 |-- structflatten(st): struct (nullable = false)
 ||-- col1: long (nullable = false)
 ||-- col2_col3: long (nullable = false)
```

## How was this patch tested?

Added new tests to `CollectionExpressionsSuite` to check flattening of 2-3 
nested structures and negative tests to be sure that `struct_flatten` doesn't 
affect other types.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 struct_flatten

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21999.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21999


commit 5603918ae963f78aafb2d1f4f2bd9d566870495b
Author: Maxim Gekk 
Date:   2018-08-04T13:38:08Z

Initial implementation

commit 0be0d059b8bf571068226c515888a64093468cff
Author: Maxim Gekk 
Date:   2018-08-04T16:07:45Z

Making the depth and delimiter as parameters

commit 5666ec372a4b79f6161120584abc0c312b111bfb
Author: Maxim Gekk 
Date:   2018-08-04T18:04:23Z

Test for depth = 0

commit cd88a2125ba6932ba1fdceca1a24d57124a23afa
Author: Maxim Gekk 
Date:   2018-08-04T18:21:19Z

Test for depth = 1

commit b0da02d37ac6db38f63bac95dc295ac37fe4a692
Author: Maxim Gekk 
Date:   2018-08-04T18:30:18Z

Renaming st to struct

commit ec361791b83d71f29823157a2c2b49162ddb5901
Author: Maxim Gekk 
Date:   2018-08-04T19:24:37Z

Negative tests

commit ced63d7f093c168e2bc9457b6c08b87bfe6c0751
Author: Maxim Gekk 
Date:   2018-08-04T20:10:00Z

Register struct_flatten

commit 5b568c67951f6f620cd0d549fdbd0c25f819fe43
Author: Maxim Gekk 
Date:   2018-08-04T20:42:00Z

Merge remote-tracking branch 'origin/master' into struct_flatten

# Conflicts:
#   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-04 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21889#discussion_r207718734
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -0,0 +1,205 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.execution.FileSchemaPruningTest
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ParquetSchemaPruningSuite
+extends QueryTest
+with ParquetTest
+with FileSchemaPruningTest
+with SharedSQLContext {
+  case class FullName(first: String, middle: String, last: String)
+  case class Contact(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map())
+
+  val janeDoe = FullName("Jane", "X.", "Doe")
+  val johnDoe = FullName("John", "Y.", "Doe")
+  val susanSmith = FullName("Susan", "Z.", "Smith")
+
+  val contacts =
+Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith),
+  relatives = Map("brother" -> johnDoe)) ::
+Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> 
janeDoe)) :: Nil
+
+  case class Name(first: String, last: String)
+  case class BriefContact(id: Int, name: Name, address: String)
+
+  val briefContacts =
+BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") ::
+BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil
+
+  case class ContactWithDataPartitionColumn(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map(),
+p: Int)
+
+  case class BriefContactWithDataPartitionColumn(id: Int, name: Name, 
address: String, p: Int)
+
+  val contactsWithDataPartitionColumn =
+contacts.map { case Contact(id, name, address, pets, friends, 
relatives) =>
+  ContactWithDataPartitionColumn(id, name, address, pets, friends, 
relatives, 1) }
+  val briefContactsWithDataPartitionColumn =
+briefContacts.map { case BriefContact(id, name, address) =>
+  BriefContactWithDataPartitionColumn(id, name, address, 2) }
+
+  testSchemaPruning("select a single complex field") {
+val query = sql("select name.middle from contacts order by id")
+checkScanSchemata(query, "struct>")
+checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: 
Nil)
+  }
+
+  testSchemaPruning("select a single complex field and its parent struct") 
{
+val query = sql("select name.middle, name from contacts order by id")
+checkScanSchemata(query, 
"struct>")
+checkAnswer(query,
+  Row("X.", Row("Jane", "X.", "Doe")) ::
+  Row("Y.", Row("John", "Y.", "Doe")) ::
+  Row(null, Row("Janet", null, "Jones")) ::
+  Row(null, Row("Jim", null, "Jones")) ::
+  Nil)
+  }
+
+  testSchemaPruning("select a single complex field array and its parent 
struct array") {
+val query = sql("select friends.middle, friends from contacts where 
p=1 order by id")
+checkScanSchemata(query,
+  
"struct>>")
+checkAnswer(query,
+  Row(Array("Z."), Array(Row("Susan", "Z.", "Smith"))) ::
+  Row(Array.empty[String], Array.empty[Row]) ::
+  Nil)
+  }
+
+  testSchemaPruning("select a single complex field from a map entry and 
its parent map entry") {
+val query =
+  sql("select relatives[\"brother\"].middle, relatives[\"brother\"] 
from contacts where p=1 " +
+"order by id")
+checkScanSchemata(query,
+  
"struct>>")
+checkAnswer(query,
+  Row(

[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-04 Thread ajacques
Github user ajacques commented on a diff in the pull request:

https://github.com/apache/spark/pull/21889#discussion_r207718713
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -0,0 +1,205 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.execution.FileSchemaPruningTest
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ParquetSchemaPruningSuite
+extends QueryTest
+with ParquetTest
+with FileSchemaPruningTest
+with SharedSQLContext {
+  case class FullName(first: String, middle: String, last: String)
+  case class Contact(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map())
+
+  val janeDoe = FullName("Jane", "X.", "Doe")
+  val johnDoe = FullName("John", "Y.", "Doe")
+  val susanSmith = FullName("Susan", "Z.", "Smith")
+
+  val contacts =
+Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith),
+  relatives = Map("brother" -> johnDoe)) ::
+Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> 
janeDoe)) :: Nil
+
+  case class Name(first: String, last: String)
+  case class BriefContact(id: Int, name: Name, address: String)
+
+  val briefContacts =
+BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") ::
+BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil
+
+  case class ContactWithDataPartitionColumn(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map(),
+p: Int)
+
+  case class BriefContactWithDataPartitionColumn(id: Int, name: Name, 
address: String, p: Int)
+
+  val contactsWithDataPartitionColumn =
+contacts.map { case Contact(id, name, address, pets, friends, 
relatives) =>
+  ContactWithDataPartitionColumn(id, name, address, pets, friends, 
relatives, 1) }
+  val briefContactsWithDataPartitionColumn =
+briefContacts.map { case BriefContact(id, name, address) =>
+  BriefContactWithDataPartitionColumn(id, name, address, 2) }
+
+  testSchemaPruning("select a single complex field") {
+val query = sql("select name.middle from contacts order by id")
+checkScanSchemata(query, "struct>")
+checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: 
Nil)
+  }
+
+  testSchemaPruning("select a single complex field and its parent struct") 
{
+val query = sql("select name.middle, name from contacts order by id")
+checkScanSchemata(query, 
"struct>")
+checkAnswer(query,
+  Row("X.", Row("Jane", "X.", "Doe")) ::
+  Row("Y.", Row("John", "Y.", "Doe")) ::
+  Row(null, Row("Janet", null, "Jones")) ::
+  Row(null, Row("Jim", null, "Jones")) ::
+  Nil)
+  }
+
+  testSchemaPruning("select a single complex field array and its parent 
struct array") {
+val query = sql("select friends.middle, friends from contacts where 
p=1 order by id")
+checkScanSchemata(query,
+  
"struct>>")
+checkAnswer(query,
+  Row(Array("Z."), Array(Row("Susan", "Z.", "Smith"))) ::
+  Row(Array.empty[String], Array.empty[Row]) ::
+  Nil)
+  }
+
+  testSchemaPruning("select a single complex field from a map entry and 
its parent map entry") {
+val query =
+  sql("select relatives[\"brother\"].middle, relatives[\"brother\"] 
from contacts where p=1 " +
+"order by id")
+checkScanSchemata(query,
+  
"struct>>")
+checkAnswer(query,
+  Row

[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21102
  
**[Test build #94221 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94221/testReport)**
 for PR 21102 at commit 
[`6fba1ee`](https://github.com/apache/spark/commit/6fba1ee8c3525a6f34bf5580737d067a8f0d976d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21102
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-04 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21102
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21102
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1807/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...

2018-08-04 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/21917
  
Still playing devil's advocate here, I don't think stopping at 3 in your 
example actually tells you anything about the cause of the gaps in the sequence 
at 4.  I'm not sure you can know that the gap is because of a transaction 
marker, without a modified kafka consumer library.

If the actual problem is that when allowNonConsecutiveOffsets is set we 
need to allow gaps even at the end of an offset range... why not just fix that 
directly?

Master is updated to kafka 2.0 at this point, so we should be able to write 
a test for your original jira example of a partition consisting of 1 message 
followed by 1 transaction commit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21987: [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7

2018-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21987


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21987: [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7

2018-08-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21987
  
Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-04 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207718090
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -169,25 +181,50 @@ package object expressions  {
 })
   }
 
-  // Find matches for the given name assuming that the 1st part is a 
qualifier (i.e. table name,
-  // alias, or subquery alias) and the 2nd part is the actual name. 
This returns a tuple of
+  // Find matches for the given name assuming that the 1st two parts 
are qualifier
+  // (i.e. database name and table name) and the 3rd part is the 
actual column name.
+  //
+  // For example, consider an example where "db1" is the database 
name, "a" is the table name
+  // and "b" is the column name and "c" is the struct field name.
+  // If the name parts is db1.a.b.c, then Attribute will match
--- End diff --

In this case, if a.b.c fails to resolve as db.table.column, then we check 
if there is a table  and column that matches a.b and then see if c is a nested 
field name and if it exists, it will resolve to the nested field. 

Tests with struct nested fields are 
[here](https://github.com/skambha/spark/blob/90cd6d33f59fdae16a3a386ed14cefb3f28d35a8/sql/core/src/test/resources/sql-tests/inputs/columnresolution.sql#L65-L73)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21987: [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21987
  
**[Test build #4234 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4234/testReport)**
 for PR 21987 at commit 
[`654b918`](https://github.com/apache/spark/commit/654b918fc536de67c88e1bc4d618d27aa4fb76f1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when ...

2018-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21997


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-04 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207717569
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
 ---
@@ -316,8 +345,8 @@ case class UnresolvedRegex(regexPattern: String, table: 
Option[String], caseSens
   // If there is no table specified, use all input attributes that 
match expr
   case None => input.output.filter(_.name.matches(pattern))
   // If there is a table, pick out attributes that are part of this 
table that match expr
-  case Some(t) => input.output.filter(_.qualifier.exists(resolver(_, 
t)))
-.filter(_.name.matches(pattern))
+  case Some(t) => input.output.filter(a => resolver(a.qualifier.last, 
t)).
--- End diff --

Sure. Will add a defensive check. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94214/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-04 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207717536
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -536,12 +536,13 @@ abstract class SessionCatalogSuite extends 
AnalysisTest {
   assert(metadata.viewText.isDefined)
   val view = View(desc = metadata, output = 
metadata.schema.toAttributes,
 child = CatalystSqlParser.parsePlan(metadata.viewText.get))
+  val alias = AliasIdentifier("view1", Some("db3"))
--- End diff --

Sure, let me do that if this style is preferred. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when no new ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21997
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when no new ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21997
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94218/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when no new ...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21997
  
**[Test build #94218 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94218/testReport)**
 for PR 21997 at commit 
[`7558d42`](https://github.com/apache/spark/commit/7558d422ae24daf9d3cffc43b5ef3d975c4c9d3a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ArrayFilter(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #94214 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94214/testReport)**
 for PR 21898 at commit 
[`53aa316`](https://github.com/apache/spark/commit/53aa316bb10344fdec3ed4378f9386a3400fa8cb).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21980
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21980
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94215/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1806/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21980
  
**[Test build #94215 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94215/testReport)**
 for PR 21980 at commit 
[`39db5aa`](https://github.com/apache/spark/commit/39db5aa65221a401179a47ca58a9f32762ee1509).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Uuid(randomSeed: Option[Long] = None) extends 
LeafExpression with Stateful`
  * `trait ExpressionWithRandomSeed `
  * `case class Rand(child: Expression) extends RDG with 
ExpressionWithRandomSeed `
  * `case class Randn(child: Expression) extends RDG with 
ExpressionWithRandomSeed `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20838
  
**[Test build #94219 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94219/testReport)**
 for PR 20838 at commit 
[`e41a8cc`](https://github.com/apache/spark/commit/e41a8cc303eb428fe99bacae3b9c87be57b03125).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #94220 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94220/testReport)**
 for PR 16677 at commit 
[`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-04 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20838
  
jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16677
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-04 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20838
  
looks to me everything passes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94216/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21996: [SPARK-24888][CORE] spark-submit --master spark:/...

2018-08-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21996#discussion_r207717051
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -98,17 +98,24 @@ private[spark] class SparkSubmit extends Logging {
* Kill an existing submission using the REST protocol. Standalone and 
Mesos cluster mode only.
*/
   private def kill(args: SparkSubmitArguments): Unit = {
-new RestSubmissionClient(args.master)
-  .killSubmission(args.submissionToKill)
+createRestSubmissionClient(args).killSubmission(args.submissionToKill)
   }
 
   /**
* Request the status of an existing submission using the REST protocol.
* Standalone and Mesos cluster mode only.
*/
   private def requestStatus(args: SparkSubmitArguments): Unit = {
-new RestSubmissionClient(args.master)
-  .requestSubmissionStatus(args.submissionToRequestStatusFor)
+
createRestSubmissionClient(args).requestSubmissionStatus(args.submissionToRequestStatusFor)
+  }
+
+  /**
+   * Creates RestSubmissionClient with overridden logInfo()
+   */
+  private def createRestSubmissionClient(args: SparkSubmitArguments): 
RestSubmissionClient = {
+new RestSubmissionClient(args.master) {
+  override protected def logInfo(msg: => String): Unit = 
printMessage(msg)
--- End diff --

this is not necessarily always the case - user can config log level easily?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #94216 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94216/testReport)**
 for PR 16677 at commit 
[`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21996: [SPARK-24888][CORE] spark-submit --master spark://host:p...

2018-08-04 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21996
  
I think we generally describe the change in PR title. what user see you can 
put as JIRA title.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-04 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21977
  
build error
```
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala:88:
 method sparkContext in class SparkPlan cannot be accessed in 
org.apache.spark.sql.execution.SparkPlan
[error]  Access to protected method sparkContext not permitted because
[error]  prefix type org.apache.spark.sql.execution.SparkPlan does not 
conform to
[error]  class ArrowEvalPythonExec in package python where the access take 
place
[error]   child.sparkContext.conf).compute(batchIter, 
context.partitionId(), context)
[error] ^
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

2018-08-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21977#discussion_r207716854
  
--- Diff: python/pyspark/worker.py ---
@@ -259,6 +260,26 @@ def main(infile, outfile):
  "PYSPARK_DRIVER_PYTHON are correctly set.") %
 ("%d.%d" % sys.version_info[:2], version))
 
+# set up memory limits
+memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', 
"-1"))
+total_memory = resource.RLIMIT_AS
+try:
+(total_memory_limit, max_total_memory) = 
resource.getrlimit(total_memory)
+msg = "Current mem: {0} of max 
{1}\n".format(total_memory_limit, max_total_memory)
+sys.stderr.write(msg)
--- End diff --

seems like the pattern `print(msg, file=sys.stderr)` is used here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

2018-08-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21977#discussion_r207716877
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -133,10 +133,17 @@ private[yarn] class YarnAllocator(
   // Additional memory overhead.
   protected val memoryOverhead: Int = 
sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
 math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, 
MEMORY_OVERHEAD_MIN)).toInt
+  protected val pysparkWorkerMemory: Int = if 
(sparkConf.get(IS_PYTHON_APP)) {
+sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0)
--- End diff --

nit: default to -1 to be consistent?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

2018-08-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21977#discussion_r207716893
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -133,10 +133,17 @@ private[yarn] class YarnAllocator(
   // Additional memory overhead.
   protected val memoryOverhead: Int = 
sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
 math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, 
MEMORY_OVERHEAD_MIN)).toInt
+  protected val pysparkWorkerMemory: Int = if 
(sparkConf.get(IS_PYTHON_APP)) {
+sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0)
--- End diff --

or just use 0 in worker.py too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

2018-08-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21977#discussion_r207716813
  
--- Diff: python/pyspark/worker.py ---
@@ -259,6 +260,26 @@ def main(infile, outfile):
  "PYSPARK_DRIVER_PYTHON are correctly set.") %
 ("%d.%d" % sys.version_info[:2], version))
 
+# set up memory limits
+memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', 
"-1"))
+total_memory = resource.RLIMIT_AS
+try:
+(total_memory_limit, max_total_memory) = 
resource.getrlimit(total_memory)
+msg = "Current mem: {0} of max 
{1}\n".format(total_memory_limit, max_total_memory)
+sys.stderr.write(msg)
+
+if memory_limit_mb > 0 and total_memory_limit == 
resource.RLIM_INFINITY:
+# convert to bytes
+total_memory_limit = memory_limit_mb * 1024 * 1024
+
+msg = "Setting mem to {0} of max 
{1}\n".format(total_memory_limit, max_total_memory)
+sys.stderr.write(msg)
+resource.setrlimit(total_memory, (total_memory_limit, 
total_memory_limit))
+
+except (resource.error, OSError) as e:
+# not all systems support resource limits, so warn instead of 
failing
+sys.stderr.write("WARN: Failed to set memory limit: 
{0}\n".format(e))
--- End diff --

catch ValueError also in the case hard limit can't be set (if it's 
otherwise set)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21998
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21919: [SPARK-24933][SS] Report numOutputRows in SinkProgress

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21919
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94217/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21919: [SPARK-24933][SS] Report numOutputRows in SinkProgress

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21919
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21919: [SPARK-24933][SS] Report numOutputRows in SinkProgress

2018-08-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21919
  
**[Test build #94217 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94217/testReport)**
 for PR 21919 at commit 
[`fde6053`](https://github.com/apache/spark/commit/fde6053f551ce292c486e2669e2ada50b61cc68b).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait StreamWriterProgressCollector `
  * `class MicroBatchWriter(batchId: Long, writer: StreamWriter) extends 
DataSourceWriter`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21998
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21998
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveC...

2018-08-04 Thread jzhuge
GitHub user jzhuge opened a pull request:

https://github.com/apache/spark/pull/21998

[SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesceHints

## What changes were proposed in this pull request?

Follow up to fix an unmerged review comment.

## How was this patch tested?

Unit test ResolveHintsSuite.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jzhuge/spark SPARK-24940

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21998.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21998


commit 8f0c57804e75b2a74f7573a21f6c7c63f7b85e03
Author: John Zhuge 
Date:   2018-08-04T19:01:46Z

[SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesceHints




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   >