[GitHub] spark pull request #21982: [SPARK-23911][SQL] Add aggregate function.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21982 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21982: [SPARK-23911][SQL] Add aggregate function.
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21982 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21102 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21102 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94221/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21102 **[Test build #94221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94221/testReport)** for PR 21102 at commit [`6fba1ee`](https://github.com/apache/spark/commit/6fba1ee8c3525a6f34bf5580737d067a8f0d976d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ArrayIntersect(left: Expression, right: Expression) extends ArraySetLike` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21977 **[Test build #94229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94229/testReport)** for PR 21977 at commit [`0c0ff92`](https://github.com/apache/spark/commit/0c0ff92693945f3c5ae63f60af94b88281be0c32). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94229/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21977 **[Test build #94229 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94229/testReport)** for PR 21977 at commit [`0c0ff92`](https://github.com/apache/spark/commit/0c0ff92693945f3c5ae63f60af94b88281be0c32). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1812/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @mallman: I've rebased on top of your changes and pushed. I'm seeing the following: Given the following schema: ``` root |-- id: integer (nullable = true) |-- name: struct (nullable = true) ||-- first: string (nullable = true) ||-- middle: string (nullable = true) ||-- last: string (nullable = true) |-- address: string (nullable = true) |-- pets: integer (nullable = true) |-- friends: array (nullable = true) ||-- element: struct (containsNull = true) |||-- first: string (nullable = true) |||-- middle: string (nullable = true) |||-- last: string (nullable = true) |-- relatives: map (nullable = true) ||-- key: string ||-- value: struct (valueContainsNull = true) |||-- first: string (nullable = true) |||-- middle: string (nullable = true) |||-- last: string (nullable = true) |-- p: integer (nullable = true) ``` The query: `select name.middle, address from temp` throws: ``` Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/private/var/folders/ss/cw601dzn59b2nygs8k1bs78x75lhr0/T/spark-cab140ca-cbba-4dc1-9fe5-6ae739dab70a/contacts/p=2/part-0-91d2abf5-625f-4080-b34c-e373b89c9895-c000.snappy.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186) ... 20 more Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99) at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:97) at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:92) at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) ... 25 more ``` No root cause yet, but I noticed this while working with the unit tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21889 **[Test build #94228 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94228/testReport)** for PR 21889 at commit [`92901da`](https://github.com/apache/spark/commit/92901da3785ce94db501a4c3d9be6316cfbf29a9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94219/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20838 **[Test build #94219 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94219/testReport)** for PR 20838 at commit [`e41a8cc`](https://github.com/apache/spark/commit/e41a8cc303eb428fe99bacae3b9c87be57b03125). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #94227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94227/testReport)** for PR 16677 at commit [`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1811/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16677 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/21917#discussion_r207721681 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/OffsetWithRecordScannerSuite.scala --- @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.kafka010 + +import java.{util => ju} + +import scala.collection.JavaConverters._ + +import org.apache.kafka.clients.consumer.{Consumer, ConsumerRecord, ConsumerRecords} +import org.apache.kafka.common.TopicPartition + +import org.apache.spark.SparkFunSuite +import org.apache.spark.internal.Logging + +class OffsetWithRecordScannerSuite + extends SparkFunSuite +with Logging { + + class OffsetWithRecordScannerMock[K, V](records: List[Option[ConsumerRecord[K, V]]]) +extends OffsetWithRecordScanner[K, V]( + Map[String, Object]("isolation.level" -> "read_committed").asJava, 1, 1, 0.75F, true) { +var i = -1 +override protected def getNext(c: KafkaDataConsumer[K, V]): Option[ConsumerRecord[K, V]] = { + i = i + 1 + records(i) +} + + } + + val emptyConsumerRecords = new ConsumerRecords[String, String](ju.Collections.emptyMap()) + val tp = new TopicPartition("topic", 0) + + test("Rewinder construction should fail if isolation level isn set to read_committed") { +intercept[IllegalStateException] { + new OffsetWithRecordScanner[String, String]( +Map[String, Object]("isolation.level" -> "read_uncommitted").asJava, 1, 1, 0.75F, true) +} + } + + test("Rewinder construction shouldn't fail if isolation level isn't set") { + assert(new OffsetWithRecordScanner[String, String]( +Map[String, Object]().asJava, 1, 1, 0.75F, true) != null) + } + + test("Rewinder construction should fail if isolation level isn't set to committed") { +intercept[IllegalStateException] { + new OffsetWithRecordScanner[String, String]( +Map[String, Object]("isolation.level" -> "read_uncommitted").asJava, 1, 1, 0.75F, true) +} + } + + test("Rewind should return the proper count.") { +var scanner = new OffsetWithRecordScannerMock[String, String]( + records(Some(0), Some(1), Some(2), Some(3))) +val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 2) +assert(offset === 2) +assert(size === 2) + } + + test("Rewind should return the proper count with gap") { +var scanner = new OffsetWithRecordScannerMock[String, String]( + records(Some(0), Some(1), Some(3), Some(4), Some(5))) +val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 3) +assert(offset === 4) +assert(size === 3) + } + + test("Rewind should return the proper count for the end of the iterator") { +var scanner = new OffsetWithRecordScannerMock[String, String]( + records(Some(0), Some(1), Some(2), None)) +val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 3) +assert(offset === 3) +assert(size === 3) + } + + test("Rewind should return the proper count missing data") { +var scanner = new OffsetWithRecordScannerMock[String, String]( + records(Some(0), None)) +val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 2) +assert(offset === 1) +assert(size === 1) + } + + test("Rewind should return the proper count without data") { +var scanner = new OffsetWithRecordScannerMock[String, String]( + records(None)) +val (offset, size) = scanner.iterateUntilLastOrEmpty(0, 0, null, 2) +assert(offset === 0) +assert(size === 0) + } + + private def records(offsets: Option[Long]*) = { +offsets.map(o => o.map(new ConsumerRecord("topic", 0, _, "k", "v"))).toList + } +} --- End diff -- Th
[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/21917#discussion_r207721657 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/OffsetRange.scala --- @@ -90,21 +90,23 @@ final class OffsetRange private( val topic: String, val partition: Int, val fromOffset: Long, -val untilOffset: Long) extends Serializable { +val untilOffset: Long, +val recordNumber: Long) extends Serializable { --- End diff -- Does mima actually complain about binary compatibility if you just make recordNumber count? It's just an accessor either way... If so, and you have to do this, I'd name this recordCount consistently throughout. Number could refer to a lot of things that aren't counts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94220/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #94220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94220/testReport)** for PR 16677 at commit [`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/21917#discussion_r207721492 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala --- @@ -191,6 +211,11 @@ private[kafka010] class InternalKafkaConsumer[K, V]( buffer.previous() } + def seekAndPoll(offset: Long, timeout: Long): ConsumerRecords[K, V] = { --- End diff -- Is this used anywhere? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/21917#discussion_r207721482 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala --- @@ -183,6 +187,22 @@ private[kafka010] class InternalKafkaConsumer[K, V]( record } + /** + * Similar to compactedStart but will return None if poll doesn't --- End diff -- Did you mean compactedNext? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21917: [SPARK-24720][STREAMING-KAFKA] add option to alig...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/21917#discussion_r207721435 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala --- @@ -223,17 +240,46 @@ private[spark] class DirectKafkaInputDStream[K, V]( }.getOrElse(offsets) } - override def compute(validTime: Time): Option[KafkaRDD[K, V]] = { -val untilOffsets = clamp(latestOffsets()) -val offsetRanges = untilOffsets.map { case (tp, uo) => - val fo = currentOffsets(tp) - OffsetRange(tp.topic, tp.partition, fo, uo) + /** + * Return the offset range. For non consecutive offset the last offset must have record. + * If offsets have missing data (transaction marker or abort), increases the + * range until we get the requested number of record or no more records. + * Because we have to iterate over all the records in this case, + * we also return the total number of records. + * @param offsets the target range we would like if offset were continue + * @return (totalNumberOfRecords, updated offset) + */ + private def alignRanges(offsets: Map[TopicPartition, Long]): Iterable[OffsetRange] = { +if (nonConsecutive) { + val localRw = rewinder() + val localOffsets = currentOffsets + context.sparkContext.parallelize(offsets.toList).mapPartitions(tpos => { --- End diff -- Because this isn't a kafka rdd, it isn't going to take advantage of preferred locations, which means it's going to create cached consumers on different executors. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/21917 > How do you know that offset 4 isn't just lost because poll failed? By failed, you mean returned an empty collection after timing out, even though records should be available? You don't. You also don't know that it isn't just lost because kafka skipped a message. AFAIK from the information you have from a kafka consumer, once you start allowing gaps in offsets, you don't know. I understand your point, but even under your proposal you have no guarantee that the poll won't work in your first pass during RDD construction, and then fail on the executor during computation, right? > The issue with your proposal is that SeekToEnd gives you the last offset which might not be the last record. Have you tested comparing the results of consumer.endOffsets for consumers with different isolation levels? Your proposal might end up being the best approach anyway, just because of the unfortunate effect of StreamInputInfo and count, but I want to make sure we think this through. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22000: [SPARK-25025][SQL] Remove the default value of isAll in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22000 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1810/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22000: [SPARK-25025][SQL] Remove the default value of isAll in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22000 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22000: [SPARK-25025][SQL] Remove the default value of isAll in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22000 **[Test build #94226 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94226/testReport)** for PR 22000 at commit [`5765098`](https://github.com/apache/spark/commit/57650985eca7de23c6c8cf2812d673702c8e1f3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22000: [SPARK-25025][SQL] Remove the default value of is...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22000 [SPARK-25025][SQL] Remove the default value of isAll in INTERSECT/EXCEPT ## What changes were proposed in this pull request? Having the default value of isAll in the logical plan nodes INTERSECT/EXCEPT could introduce bugs when the callers are not aware of it. This PR removes the default value and makes caller explicitly specify them. ## How was this patch tested? This is a refactoring changes. Existing tests test the functionality already. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25025 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22000.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22000 commit 57650985eca7de23c6c8cf2812d673702c8e1f3d Author: Dilip Biswal Date: 2018-08-04T21:32:57Z [SPARK-25025] Remove the default value of isAll in INTERSECT/EXCEPT --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...
Github user QuentinAmbard commented on the issue: https://github.com/apache/spark/pull/21917 If you are doing it in advance you'll change the range, so for example you read until 3 and don't get any extra results. Maybe it's because of a transaction offset, maybe another issue, it's ok in both cases. The big difference is that the next batch will restart from offset 3 and poll from this value. If seek to 3 and poll get you another result (for example 6) then everything is fine it's not a data loss it's just a gap. The issue with your proposal is that SeekToEnd gives you the last offset which might not be the last record. So in your example if last offset is 5 and after a few poll the last record you get is 3 what do you do, continue and execute the next batch from 5? How do you know that offset 4 isn't just lost because poll failed? The only way to know that would be to get a record with an offset higher than 5. In this case you know it's just a gap. But if the message you are reading is the last of the topic you won't have records higher than 3, do you can't tell if it's a poll failure or an empty offset because of the transaction commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21898 **[Test build #94225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94225/testReport)** for PR 21898 at commit [`53aa316`](https://github.com/apache/spark/commit/53aa316bb10344fdec3ed4378f9386a3400fa8cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21898 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21898 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1809/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/21898 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/21898 It might be caused by jenkins workload. The barrier test also failed due to timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94214/testReport/org.apache.spark.scheduler/BarrierTaskContextSuite/throw_exception_if_barrier___call_doesn_t_happen_on_every_task/. This means we should increase the timeout in our test to handle heavy workload scenarios. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/21917 If the last offset in the range as calculated by the driver is 5, and on the executor all you can poll up to after a repeated attempt is 3, and the user already told you to allowNonConsecutiveOffsets... then you're done, no error. Why does it matter if you do this logic when you're reading all the messages in advance and counting, or when you're actually computing? To put it another way, this PR is a lot of code change and refactoring, why not just change the logic of e.g. how CompactedKafkaRDDIterator interacts with compactedNext? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1808/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #94224 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94224/testReport)** for PR 21320 at commit [`a87b589`](https://github.com/apache/spark/commit/a87b589c70af69d25424b03b52e29165ab4e0b5f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21889#discussion_r207719862 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.execution.FileSchemaPruningTest +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class ParquetSchemaPruningSuite +extends QueryTest +with ParquetTest +with FileSchemaPruningTest +with SharedSQLContext { + case class FullName(first: String, middle: String, last: String) + case class Contact( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map()) + + val janeDoe = FullName("Jane", "X.", "Doe") + val johnDoe = FullName("John", "Y.", "Doe") + val susanSmith = FullName("Susan", "Z.", "Smith") + + val contacts = +Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith), + relatives = Map("brother" -> johnDoe)) :: +Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> janeDoe)) :: Nil + + case class Name(first: String, last: String) + case class BriefContact(id: Int, name: Name, address: String) + + val briefContacts = +BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") :: +BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil + + case class ContactWithDataPartitionColumn( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map(), +p: Int) + + case class BriefContactWithDataPartitionColumn(id: Int, name: Name, address: String, p: Int) + + val contactsWithDataPartitionColumn = +contacts.map { case Contact(id, name, address, pets, friends, relatives) => + ContactWithDataPartitionColumn(id, name, address, pets, friends, relatives, 1) } + val briefContactsWithDataPartitionColumn = +briefContacts.map { case BriefContact(id, name, address) => + BriefContactWithDataPartitionColumn(id, name, address, 2) } + + testSchemaPruning("select a single complex field") { +val query = sql("select name.middle from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: Nil) + } + + testSchemaPruning("select a single complex field and its parent struct") { +val query = sql("select name.middle, name from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, + Row("X.", Row("Jane", "X.", "Doe")) :: + Row("Y.", Row("John", "Y.", "Doe")) :: + Row(null, Row("Janet", null, "Jones")) :: + Row(null, Row("Jim", null, "Jones")) :: + Nil) + } + + testSchemaPruning("select a single complex field array and its parent struct array") { +val query = sql("select friends.middle, friends from contacts where p=1 order by id") +checkScanSchemata(query, + "struct>>") +checkAnswer(query, + Row(Array("Z."), Array(Row("Susan", "Z.", "Smith"))) :: + Row(Array.empty[String], Array.empty[Row]) :: + Nil) + } + + testSchemaPruning("select a single complex field from a map entry and its parent map entry") { +val query = + sql("select relatives[\"brother\"].middle, relatives[\"brother\"] from contacts where p=1 " + +"order by id") +checkScanSchemata(query, + "struct>>") +checkAnswer(query, + Row(
[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...
Github user QuentinAmbard commented on the issue: https://github.com/apache/spark/pull/21917 I'm not sure to understand your point. The cause of the gap doesn't matter, we just want to stop on an existing offset to be able to poll it. It can be because of a transaction marker, a transaction abort or even just a temporary poll failure it's not relevant in this case. The driver is smart enough to be able to restart from any Offset, even in the middle of a transaction (abort or not) The issue with gap at the end is that you can't know if it's a gap or if the poll failed. For example SeekToEnd gives you 5 but the last record you get is 3 and there is no way to know if 4 is missing or just an offset gap. How could we fix that in a different way? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21999 **[Test build #94223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94223/testReport)** for PR 21999 at commit [`8de1465`](https://github.com/apache/spark/commit/8de14652b838ea053f430d17129c73c85cb2e0cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21999 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94222/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21999 **[Test build #94222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94222/testReport)** for PR 21999 at commit [`5b568c6`](https://github.com/apache/spark/commit/5b568c67951f6f620cd0d549fdbd0c25f819fe43). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public abstract class AbstractLauncher> ` * `case class ArrayFilter(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21999 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21999 **[Test build #94222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94222/testReport)** for PR 21999 at commit [`5b568c6`](https://github.com/apache/spark/commit/5b568c67951f6f620cd0d549fdbd0c25f819fe43). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21999 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21999: [WIP][SQL] Flattening nested structures
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21999 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21999: [WIP][SQL] Flattening nested structures
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21999 [WIP][SQL] Flattening nested structures ## What changes were proposed in this pull request? In the PR, I propose new unary expression `StructFlatten` for flattening nested structures. For example, a dataset with the schema: ``` root |-- st: struct (nullable = false) ||-- col1: long (nullable = false) ||-- col2: struct (nullable = false) |||-- col3: long (nullable = false) ``` by applying `struct_flatten(st)` it will be transformed to: ``` root |-- structflatten(st): struct (nullable = false) ||-- col1: long (nullable = false) ||-- col2_col3: long (nullable = false) ``` ## How was this patch tested? Added new tests to `CollectionExpressionsSuite` to check flattening of 2-3 nested structures and negative tests to be sure that `struct_flatten` doesn't affect other types. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 struct_flatten Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21999.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21999 commit 5603918ae963f78aafb2d1f4f2bd9d566870495b Author: Maxim Gekk Date: 2018-08-04T13:38:08Z Initial implementation commit 0be0d059b8bf571068226c515888a64093468cff Author: Maxim Gekk Date: 2018-08-04T16:07:45Z Making the depth and delimiter as parameters commit 5666ec372a4b79f6161120584abc0c312b111bfb Author: Maxim Gekk Date: 2018-08-04T18:04:23Z Test for depth = 0 commit cd88a2125ba6932ba1fdceca1a24d57124a23afa Author: Maxim Gekk Date: 2018-08-04T18:21:19Z Test for depth = 1 commit b0da02d37ac6db38f63bac95dc295ac37fe4a692 Author: Maxim Gekk Date: 2018-08-04T18:30:18Z Renaming st to struct commit ec361791b83d71f29823157a2c2b49162ddb5901 Author: Maxim Gekk Date: 2018-08-04T19:24:37Z Negative tests commit ced63d7f093c168e2bc9457b6c08b87bfe6c0751 Author: Maxim Gekk Date: 2018-08-04T20:10:00Z Register struct_flatten commit 5b568c67951f6f620cd0d549fdbd0c25f819fe43 Author: Maxim Gekk Date: 2018-08-04T20:42:00Z Merge remote-tracking branch 'origin/master' into struct_flatten # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21889#discussion_r207718734 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.execution.FileSchemaPruningTest +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class ParquetSchemaPruningSuite +extends QueryTest +with ParquetTest +with FileSchemaPruningTest +with SharedSQLContext { + case class FullName(first: String, middle: String, last: String) + case class Contact( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map()) + + val janeDoe = FullName("Jane", "X.", "Doe") + val johnDoe = FullName("John", "Y.", "Doe") + val susanSmith = FullName("Susan", "Z.", "Smith") + + val contacts = +Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith), + relatives = Map("brother" -> johnDoe)) :: +Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> janeDoe)) :: Nil + + case class Name(first: String, last: String) + case class BriefContact(id: Int, name: Name, address: String) + + val briefContacts = +BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") :: +BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil + + case class ContactWithDataPartitionColumn( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map(), +p: Int) + + case class BriefContactWithDataPartitionColumn(id: Int, name: Name, address: String, p: Int) + + val contactsWithDataPartitionColumn = +contacts.map { case Contact(id, name, address, pets, friends, relatives) => + ContactWithDataPartitionColumn(id, name, address, pets, friends, relatives, 1) } + val briefContactsWithDataPartitionColumn = +briefContacts.map { case BriefContact(id, name, address) => + BriefContactWithDataPartitionColumn(id, name, address, 2) } + + testSchemaPruning("select a single complex field") { +val query = sql("select name.middle from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: Nil) + } + + testSchemaPruning("select a single complex field and its parent struct") { +val query = sql("select name.middle, name from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, + Row("X.", Row("Jane", "X.", "Doe")) :: + Row("Y.", Row("John", "Y.", "Doe")) :: + Row(null, Row("Janet", null, "Jones")) :: + Row(null, Row("Jim", null, "Jones")) :: + Nil) + } + + testSchemaPruning("select a single complex field array and its parent struct array") { +val query = sql("select friends.middle, friends from contacts where p=1 order by id") +checkScanSchemata(query, + "struct>>") +checkAnswer(query, + Row(Array("Z."), Array(Row("Susan", "Z.", "Smith"))) :: + Row(Array.empty[String], Array.empty[Row]) :: + Nil) + } + + testSchemaPruning("select a single complex field from a map entry and its parent map entry") { +val query = + sql("select relatives[\"brother\"].middle, relatives[\"brother\"] from contacts where p=1 " + +"order by id") +checkScanSchemata(query, + "struct>>") +checkAnswer(query, + Row(
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user ajacques commented on a diff in the pull request: https://github.com/apache/spark/pull/21889#discussion_r207718713 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.execution.FileSchemaPruningTest +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class ParquetSchemaPruningSuite +extends QueryTest +with ParquetTest +with FileSchemaPruningTest +with SharedSQLContext { + case class FullName(first: String, middle: String, last: String) + case class Contact( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map()) + + val janeDoe = FullName("Jane", "X.", "Doe") + val johnDoe = FullName("John", "Y.", "Doe") + val susanSmith = FullName("Susan", "Z.", "Smith") + + val contacts = +Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith), + relatives = Map("brother" -> johnDoe)) :: +Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> janeDoe)) :: Nil + + case class Name(first: String, last: String) + case class BriefContact(id: Int, name: Name, address: String) + + val briefContacts = +BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") :: +BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil + + case class ContactWithDataPartitionColumn( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map(), +p: Int) + + case class BriefContactWithDataPartitionColumn(id: Int, name: Name, address: String, p: Int) + + val contactsWithDataPartitionColumn = +contacts.map { case Contact(id, name, address, pets, friends, relatives) => + ContactWithDataPartitionColumn(id, name, address, pets, friends, relatives, 1) } + val briefContactsWithDataPartitionColumn = +briefContacts.map { case BriefContact(id, name, address) => + BriefContactWithDataPartitionColumn(id, name, address, 2) } + + testSchemaPruning("select a single complex field") { +val query = sql("select name.middle from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: Nil) + } + + testSchemaPruning("select a single complex field and its parent struct") { +val query = sql("select name.middle, name from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, + Row("X.", Row("Jane", "X.", "Doe")) :: + Row("Y.", Row("John", "Y.", "Doe")) :: + Row(null, Row("Janet", null, "Jones")) :: + Row(null, Row("Jim", null, "Jones")) :: + Nil) + } + + testSchemaPruning("select a single complex field array and its parent struct array") { +val query = sql("select friends.middle, friends from contacts where p=1 order by id") +checkScanSchemata(query, + "struct>>") +checkAnswer(query, + Row(Array("Z."), Array(Row("Susan", "Z.", "Smith"))) :: + Row(Array.empty[String], Array.empty[Row]) :: + Nil) + } + + testSchemaPruning("select a single complex field from a map entry and its parent map entry") { +val query = + sql("select relatives[\"brother\"].middle, relatives[\"brother\"] from contacts where p=1 " + +"order by id") +checkScanSchemata(query, + "struct>>") +checkAnswer(query, + Row
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21102 **[Test build #94221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94221/testReport)** for PR 21102 at commit [`6fba1ee`](https://github.com/apache/spark/commit/6fba1ee8c3525a6f34bf5580737d067a8f0d976d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21102 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21102 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21102 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1807/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21917: [SPARK-24720][STREAMING-KAFKA] add option to align range...
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/21917 Still playing devil's advocate here, I don't think stopping at 3 in your example actually tells you anything about the cause of the gaps in the sequence at 4. I'm not sure you can know that the gap is because of a transaction marker, without a modified kafka consumer library. If the actual problem is that when allowNonConsecutiveOffsets is set we need to allow gaps even at the end of an offset range... why not just fix that directly? Master is updated to kafka 2.0 at this point, so we should be able to write a test for your original jira example of a partition consisting of 1 message followed by 1 transaction commit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21987: [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21987: [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21987 Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...
Github user skambha commented on a diff in the pull request: https://github.com/apache/spark/pull/17185#discussion_r207718090 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala --- @@ -169,25 +181,50 @@ package object expressions { }) } - // Find matches for the given name assuming that the 1st part is a qualifier (i.e. table name, - // alias, or subquery alias) and the 2nd part is the actual name. This returns a tuple of + // Find matches for the given name assuming that the 1st two parts are qualifier + // (i.e. database name and table name) and the 3rd part is the actual column name. + // + // For example, consider an example where "db1" is the database name, "a" is the table name + // and "b" is the column name and "c" is the struct field name. + // If the name parts is db1.a.b.c, then Attribute will match --- End diff -- In this case, if a.b.c fails to resolve as db.table.column, then we check if there is a table and column that matches a.b and then see if c is a nested field name and if it exists, it will resolve to the nested field. Tests with struct nested fields are [here](https://github.com/skambha/spark/blob/90cd6d33f59fdae16a3a386ed14cefb3f28d35a8/sql/core/src/test/resources/sql-tests/inputs/columnresolution.sql#L65-L73) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21987: [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21987 **[Test build #4234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4234/testReport)** for PR 21987 at commit [`654b918`](https://github.com/apache/spark/commit/654b918fc536de67c88e1bc4d618d27aa4fb76f1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21997 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...
Github user skambha commented on a diff in the pull request: https://github.com/apache/spark/pull/17185#discussion_r207717569 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala --- @@ -316,8 +345,8 @@ case class UnresolvedRegex(regexPattern: String, table: Option[String], caseSens // If there is no table specified, use all input attributes that match expr case None => input.output.filter(_.name.matches(pattern)) // If there is a table, pick out attributes that are part of this table that match expr - case Some(t) => input.output.filter(_.qualifier.exists(resolver(_, t))) -.filter(_.name.matches(pattern)) + case Some(t) => input.output.filter(a => resolver(a.qualifier.last, t)). --- End diff -- Sure. Will add a defensive check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21898 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94214/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21898 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...
Github user skambha commented on a diff in the pull request: https://github.com/apache/spark/pull/17185#discussion_r207717536 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -536,12 +536,13 @@ abstract class SessionCatalogSuite extends AnalysisTest { assert(metadata.viewText.isDefined) val view = View(desc = metadata, output = metadata.schema.toAttributes, child = CatalystSqlParser.parsePlan(metadata.viewText.get)) + val alias = AliasIdentifier("view1", Some("db3")) --- End diff -- Sure, let me do that if this style is preferred. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when no new ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21997 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when no new ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21997 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94218/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21997: [SPARK-24987][SS] - Fix Kafka consumer leak when no new ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21997 **[Test build #94218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94218/testReport)** for PR 21997 at commit [`7558d42`](https://github.com/apache/spark/commit/7558d422ae24daf9d3cffc43b5ef3d975c4c9d3a). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ArrayFilter(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21898 **[Test build #94214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94214/testReport)** for PR 21898 at commit [`53aa316`](https://github.com/apache/spark/commit/53aa316bb10344fdec3ed4378f9386a3400fa8cb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21980 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21980 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94215/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1806/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21980 **[Test build #94215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94215/testReport)** for PR 21980 at commit [`39db5aa`](https://github.com/apache/spark/commit/39db5aa65221a401179a47ca58a9f32762ee1509). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Uuid(randomSeed: Option[Long] = None) extends LeafExpression with Stateful` * `trait ExpressionWithRandomSeed ` * `case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed ` * `case class Randn(child: Expression) extends RDG with ExpressionWithRandomSeed ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20838 **[Test build #94219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94219/testReport)** for PR 20838 at commit [`e41a8cc`](https://github.com/apache/spark/commit/e41a8cc303eb428fe99bacae3b9c87be57b03125). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #94220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94220/testReport)** for PR 16677 at commit [`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20838 jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16677 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20838 looks to me everything passes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94216/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21996: [SPARK-24888][CORE] spark-submit --master spark:/...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21996#discussion_r207717051 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -98,17 +98,24 @@ private[spark] class SparkSubmit extends Logging { * Kill an existing submission using the REST protocol. Standalone and Mesos cluster mode only. */ private def kill(args: SparkSubmitArguments): Unit = { -new RestSubmissionClient(args.master) - .killSubmission(args.submissionToKill) +createRestSubmissionClient(args).killSubmission(args.submissionToKill) } /** * Request the status of an existing submission using the REST protocol. * Standalone and Mesos cluster mode only. */ private def requestStatus(args: SparkSubmitArguments): Unit = { -new RestSubmissionClient(args.master) - .requestSubmissionStatus(args.submissionToRequestStatusFor) + createRestSubmissionClient(args).requestSubmissionStatus(args.submissionToRequestStatusFor) + } + + /** + * Creates RestSubmissionClient with overridden logInfo() + */ + private def createRestSubmissionClient(args: SparkSubmitArguments): RestSubmissionClient = { +new RestSubmissionClient(args.master) { + override protected def logInfo(msg: => String): Unit = printMessage(msg) --- End diff -- this is not necessarily always the case - user can config log level easily? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #94216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94216/testReport)** for PR 16677 at commit [`69513d1`](https://github.com/apache/spark/commit/69513d166ee56587d7b039b5d9645299785dcb77). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21996: [SPARK-24888][CORE] spark-submit --master spark://host:p...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21996 I think we generally describe the change in PR title. what user see you can put as JIRA title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21977 build error ``` [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala:88: method sparkContext in class SparkPlan cannot be accessed in org.apache.spark.sql.execution.SparkPlan [error] Access to protected method sparkContext not permitted because [error] prefix type org.apache.spark.sql.execution.SparkPlan does not conform to [error] class ArrowEvalPythonExec in package python where the access take place [error] child.sparkContext.conf).compute(batchIter, context.partitionId(), context) [error] ^ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r207716854 --- Diff: python/pyspark/worker.py --- @@ -259,6 +260,26 @@ def main(infile, outfile): "PYSPARK_DRIVER_PYTHON are correctly set.") % ("%d.%d" % sys.version_info[:2], version)) +# set up memory limits +memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1")) +total_memory = resource.RLIMIT_AS +try: +(total_memory_limit, max_total_memory) = resource.getrlimit(total_memory) +msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory) +sys.stderr.write(msg) --- End diff -- seems like the pattern `print(msg, file=sys.stderr)` is used here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r207716877 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -133,10 +133,17 @@ private[yarn] class YarnAllocator( // Additional memory overhead. protected val memoryOverhead: Int = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse( math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)).toInt + protected val pysparkWorkerMemory: Int = if (sparkConf.get(IS_PYTHON_APP)) { +sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0) --- End diff -- nit: default to -1 to be consistent? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r207716893 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -133,10 +133,17 @@ private[yarn] class YarnAllocator( // Additional memory overhead. protected val memoryOverhead: Int = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse( math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)).toInt + protected val pysparkWorkerMemory: Int = if (sparkConf.get(IS_PYTHON_APP)) { +sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0) --- End diff -- or just use 0 in worker.py too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r207716813 --- Diff: python/pyspark/worker.py --- @@ -259,6 +260,26 @@ def main(infile, outfile): "PYSPARK_DRIVER_PYTHON are correctly set.") % ("%d.%d" % sys.version_info[:2], version)) +# set up memory limits +memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1")) +total_memory = resource.RLIMIT_AS +try: +(total_memory_limit, max_total_memory) = resource.getrlimit(total_memory) +msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory) +sys.stderr.write(msg) + +if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY: +# convert to bytes +total_memory_limit = memory_limit_mb * 1024 * 1024 + +msg = "Setting mem to {0} of max {1}\n".format(total_memory_limit, max_total_memory) +sys.stderr.write(msg) +resource.setrlimit(total_memory, (total_memory_limit, total_memory_limit)) + +except (resource.error, OSError) as e: +# not all systems support resource limits, so warn instead of failing +sys.stderr.write("WARN: Failed to set memory limit: {0}\n".format(e)) --- End diff -- catch ValueError also in the case hard limit can't be set (if it's otherwise set) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21998 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21919: [SPARK-24933][SS] Report numOutputRows in SinkProgress
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21919 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94217/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21919: [SPARK-24933][SS] Report numOutputRows in SinkProgress
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21919 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21919: [SPARK-24933][SS] Report numOutputRows in SinkProgress
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21919 **[Test build #94217 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94217/testReport)** for PR 21919 at commit [`fde6053`](https://github.com/apache/spark/commit/fde6053f551ce292c486e2669e2ada50b61cc68b). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait StreamWriterProgressCollector ` * `class MicroBatchWriter(batchId: Long, writer: StreamWriter) extends DataSourceWriter` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21998 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21998 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveC...
GitHub user jzhuge opened a pull request: https://github.com/apache/spark/pull/21998 [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesceHints ## What changes were proposed in this pull request? Follow up to fix an unmerged review comment. ## How was this patch tested? Unit test ResolveHintsSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jzhuge/spark SPARK-24940 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21998.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21998 commit 8f0c57804e75b2a74f7573a21f6c7c63f7b85e03 Author: John Zhuge Date: 2018-08-04T19:01:46Z [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesceHints --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org