[GitHub] [spark] AmplabJenkins removed a comment on issue #24761: [SPARK-27905] [SQL] Add higher order function 'forall'
AmplabJenkins removed a comment on issue #24761: [SPARK-27905] [SQL] Add higher order function 'forall' URL: https://github.com/apache/spark/pull/24761#issuecomment-518839775 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25309: [SPARK-28577][YARN]Resource capability requested for each executor add offHeapMemorySize
AmplabJenkins commented on issue #25309: [SPARK-28577][YARN]Resource capability requested for each executor add offHeapMemorySize URL: https://github.com/apache/spark/pull/25309#issuecomment-518841025 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25341: [SPARK-28607][CORE][SHUFFLE] Don't store partition lengths twice.
vanzin commented on a change in pull request #25341: [SPARK-28607][CORE][SHUFFLE] Don't store partition lengths twice. URL: https://github.com/apache/spark/pull/25341#discussion_r311271811 ## File path: core/src/main/java/org/apache/spark/shuffle/api/ShufflePartitionWriter.java ## @@ -85,14 +85,4 @@ default Optional openChannelWrapper() throws IOException { return Optional.empty(); } - - /** - * Returns the number of bytes written either by this writer's output stream opened by - * {@link #openStream()} or the byte channel opened by {@link #openChannelWrapper()}. - * - * This can be different from the number of bytes given by the caller. For example, the - * stream might compress or encrypt the bytes before persisting the data to the backing - * data store. - */ - long getNumBytesWritten(); Review comment: Could you update the metrics after the commit (by adding up all the partition lengths)? Otherwise it doesn't seem horrible to keep this in the API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #25366: [SPARK-27918][SQL][TEST][FOLLOW-UP] Open comment about boolean test.
dongjoon-hyun commented on issue #25366: [SPARK-27918][SQL][TEST][FOLLOW-UP] Open comment about boolean test. URL: https://github.com/apache/spark/pull/25366#issuecomment-518842348 @beliefer . Please regenerate the result. That will recover the UT failures. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test
dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test URL: https://github.com/apache/spark/pull/25366#issuecomment-518844007 BTW, @beliefer . This had better be a followup of yours ([SPARK-27924][SQL] Support ANSI SQL Boolean-Predicate syntax). I updated the PR title. Also, I changed `How was this patch tested?` section. Since you updated the test cases which are verified by Jenkins. It should not be 'N/A'. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on issue #24761: [SPARK-27905] [SQL] Add higher order function 'forall'
ueshin commented on issue #24761: [SPARK-27905] [SQL] Add higher order function 'forall' URL: https://github.com/apache/spark/pull/24761#issuecomment-518850060 Thanks! merging to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25369: [SPARK-28638][WebUI] Task summary metrics are wrong when there are running tasks
vanzin commented on a change in pull request #25369: [SPARK-28638][WebUI] Task summary metrics are wrong when there are running tasks URL: https://github.com/apache/spark/pull/25369#discussion_r311282837 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala ## @@ -156,7 +156,8 @@ private[spark] class AppStatusStore( // cheaper for disk stores (avoids deserialization). val count = { Utils.tryWithResource( -if (store.isInstanceOf[InMemoryStore]) { +if (store.isInstanceOf[ElementTrackingStore]) { Review comment: I think a new method `private def isLiveUI: Boolean` would make the intent here clearer. It can be this instance check or just checking whether `listener` is defined. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin closed pull request #24761: [SPARK-27905] [SQL] Add higher order function 'forall'
ueshin closed pull request #24761: [SPARK-27905] [SQL] Add higher order function 'forall' URL: https://github.com/apache/spark/pull/24761 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #13627: [SPARK-15906][MLlib] Add complementary naive bayes algorithm
AmplabJenkins removed a comment on issue #13627: [SPARK-15906][MLlib] Add complementary naive bayes algorithm URL: https://github.com/apache/spark/pull/13627#issuecomment-406583815 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #13627: [SPARK-15906][MLlib] Add complementary naive bayes algorithm
AmplabJenkins commented on issue #13627: [SPARK-15906][MLlib] Add complementary naive bayes algorithm URL: https://github.com/apache/spark/pull/13627#issuecomment-518854134 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao opened a new pull request #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
huaxingao opened a new pull request #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371 ## What changes were proposed in this pull request? This PR adds some tests converted from ```pgSQL/join.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Diff comparing to 'join.sql' ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out index 0730066719..4e8ab9d71a 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out @@ -240,10 +240,10 @@ struct<> -- !query 27 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t) FROM J1_TBL AS tx -- !query 27 schema -struct +struct -- !query 27 output 0 NULLzero 1 4 one @@ -259,10 +259,10 @@ struct -- !query 28 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t) FROM J1_TBL tx -- !query 28 schema -struct +struct -- !query 28 output 0 NULLzero 1 4 one @@ -278,10 +278,10 @@ struct -- !query 29 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(a), udf(b), udf(c) FROM J1_TBL AS t1 (a, b, c) -- !query 29 schema -struct +struct -- !query 29 output 0 NULLzero 1 4 one @@ -297,10 +297,10 @@ struct -- !query 30 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(a), udf(b), udf(c) FROM J1_TBL t1 (a, b, c) -- !query 30 schema -struct +struct -- !query 30 output 0 NULLzero 1 4 one @@ -316,10 +316,10 @@ struct -- !query 31 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(a), udf(b), udf(c), udf(d), udf(e) FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) -- !query 31 schema -struct +struct -- !query 31 output 0 NULLzero0 NULL 0 NULLzero1 -1 @@ -423,7 +423,7 @@ struct -- !query 32 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, * FROM J1_TBL CROSS JOIN J2_TBL -- !query 32 schema struct @@ -530,20 +530,20 @@ struct -- !query 33 -SELECT '' AS `xxx`, i, k, t +SELECT udf('') AS `xxx`, udf(i), udf(k), udf(t) FROM J1_TBL CROSS JOIN J2_TBL -- !query 33 schema struct<> -- !query 33 output org.apache.spark.sql.AnalysisException -Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 20 +Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 29 -- !query 34 -SELECT '' AS `xxx`, t1.i, k, t +SELECT udf('') AS `xxx`, udf(t1.i), udf(k), udf(t) FROM J1_TBL t1 CROSS JOIN J2_TBL t2 -- !query 34 schema -struct +struct -- !query 34 output 0 -1 zero 0 -3 zero @@ -647,11 +647,11 @@ struct -- !query 35 -SELECT '' AS `xxx`, ii, tt, kk +SELECT udf('') AS `xxx`, udf(ii), udf(tt), udf(kk) FROM (J1_TBL CROSS JOIN J2_TBL) AS tx (ii, jj, tt, ii2, kk) -- !query 35 schema -struct +struct -- !query 35 output 0 zero-1 0 zero-3 @@ -755,10 +755,10 @@ struct -- !query 36 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(j1_tbl.i), udf(j), udf(t), udf(a.i), udf(a.k), udf(b.i), udf(b.k) FROM J1_TBL CROSS JOIN J2_TBL a CROSS JOIN J2_TBL b -- !query 36 schema -struct +struct -- !query 36 output 0 NULLzero0 NULL0 NULL 0 NULLzero0 NULL1 -1 @@ -1654,10 +1654,10 @@ struct -- !query 37 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k) FROM J1_TBL INNER JOIN J2_TBL USING (i) -- !query 37 schema -struct +struct -- !query 37 output 0 NULLzeroNULL 1 4 one -1 @@ -1669,10 +1669,10 @@ struct -- !query 38 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k) FROM J1_TBL JOIN J2_TBL USING (i) -- !query 38 schema -struct +struct -- !query 38 output 0 NULLzeroNULL 1 4 one -1 @@ -1684,9 +1684,9 @@ struct -- !query 39 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, * FROM J1_TBL t1 (a, b, c) JOIN J2_TBL t2 (a,
[GitHub] [spark] huaxingao commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
huaxingao commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518855906 @HyukjinKwon @viirya Sorry I somehow messed up the original PR, so I closed it and opened a new one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518855909 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518855919 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13814/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518856479 **[Test build #108732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108732/testReport)** for PR 25371 at commit [`b7aad47`](https://github.com/apache/spark/commit/b7aad47c0121501343729c7d73a75df74722267e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518855909 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518855919 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13814/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25304: [SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API.
vanzin commented on a change in pull request #25304: [SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API. URL: https://github.com/apache/spark/pull/25304#discussion_r311290914 ## File path: core/src/main/java/org/apache/spark/shuffle/api/ShuffleExecutorComponents.java ## @@ -39,17 +40,39 @@ /** * Called once per map task to create a writer that will be responsible for persisting all the * partitioned bytes written by that map task. - * @param shuffleId Unique identifier for the shuffle the map task is a part of + * + * @param shuffleId Unique identifier for the shuffle the map task is a part of * @param mapId Within the shuffle, the identifier of the map task * @param mapTaskAttemptId Identifier of the task attempt. Multiple attempts of the same map task - * with the same (shuffleId, mapId) pair can be distinguished by the - * different values of mapTaskAttemptId. + * with the same (shuffleId, mapId) pair can be distinguished by the + * different values of mapTaskAttemptId. * @param numPartitions The number of partitions that will be written by the map task. Some of -* these partitions may be empty. + * these partitions may be empty. */ ShuffleMapOutputWriter createMapOutputWriter( int shuffleId, int mapId, long mapTaskAttemptId, int numPartitions) throws IOException; + + /** + * An optional extension for creating a map output writer that can optimize the transfer of a + * single partition file, as the entire result of a map task, to the backing store. + * + * Most implementations should return the default {@link Optional#empty()} to indicate that + * they do not support this optimization. This primarily is for backwards-compatibility in + * preserving an optimization in the local disk shuffle storage implementation. + * + * @param shuffleId Unique identifier for the shuffle the map task is a part of + * @param mapId Within the shuffle, the identifier of the map task + * @param mapTaskAttemptId Identifier of the task attempt. Multiple attempts of the same map task + * with the same (shuffleId, mapId) pair can be distinguished by the + * different values of mapTaskAttemptId. + */ + default Optional createSingleFileMapOutputWriter( Review comment: This feels like a lot of indirection to implement one method... what about returning a boolean if the transfer is supported, or throwing `UnsupportedOperationException` (although that's a bit slower)? (If the transfer is supported but fails you'd still throw an `IOException`.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test
dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test URL: https://github.com/apache/spark/pull/25366#issuecomment-518858585 I made a PR to you, @beliefer . Please review and merge. - https://github.com/beliefer/spark/pull/2 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25342: [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter
vanzin commented on a change in pull request #25342: [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter URL: https://github.com/apache/spark/pull/25342#discussion_r311293073 ## File path: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ## @@ -718,6 +717,85 @@ private[spark] class ExternalSorter[K, V, C]( lengths } + /** + * Write all the data added into this ExternalSorter into a map output writer that pushes bytes + * to some arbitrary backing store. This is called by the SortShuffleWriter. + * + * @return array of lengths, in bytes, of each partition of the file (used by map output tracker) + */ + def writePartitionedMapOutput( + shuffleId: Int, mapId: Int, mapOutputWriter: ShuffleMapOutputWriter): Array[Long] = { Review comment: nit: multi-line arg style This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25342: [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter
vanzin commented on a change in pull request #25342: [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter URL: https://github.com/apache/spark/pull/25342#discussion_r311295878 ## File path: core/src/main/scala/org/apache/spark/util/collection/ShufflePartitionPairsWriter.scala ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io.{Closeable, FilterOutputStream, OutputStream} + +import org.apache.spark.serializer.{SerializationStream, SerializerInstance, SerializerManager} +import org.apache.spark.shuffle.ShuffleWriteMetricsReporter +import org.apache.spark.shuffle.api.ShufflePartitionWriter +import org.apache.spark.storage.BlockId + +/** + * A key-value writer inspired by {@link DiskBlockObjectWriter} that pushes the bytes to an + * arbitrary partition writer instead of writing to local disk through the block manager. + */ +private[spark] class ShufflePartitionPairsWriter( +partitionWriter: ShufflePartitionWriter, +serializerManager: SerializerManager, +serializerInstance: SerializerInstance, +blockId: BlockId, +writeMetrics: ShuffleWriteMetricsReporter) + extends PairsWriter with Closeable { + + private var isOpen = false + private var partitionStream: OutputStream = _ + private var wrappedStream: OutputStream = _ + private var objOut: SerializationStream = _ + private var numRecordsWritten = 0 + private var curNumBytesWritten = 0L + + override def write(key: Any, value: Any): Unit = { +if (!isOpen) { + open() + isOpen = true +} +objOut.writeKey(key) +objOut.writeValue(value) +writeMetrics.incRecordsWritten(1) + } + + private def open(): Unit = { +partitionStream = partitionWriter.openStream +wrappedStream = serializerManager.wrapStream(blockId, partitionStream) +objOut = serializerInstance.serializeStream(wrappedStream) + } + + override def close(): Unit = { +if (isOpen) { Review comment: Minor, but if there's an error in `open()` (e.g. when initializing `wrappedStream`) this will leave the underlying `partitionStream` opened. Maybe this flag isn't needed and you can just check whether the fields are initialized? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on issue #25236: [SPARK-28487][k8s] More responsive dynamic allocation with K8S.
vanzin commented on issue #25236: [SPARK-28487][k8s] More responsive dynamic allocation with K8S. URL: https://github.com/apache/spark/pull/25236#issuecomment-518863671 Let's see if @mccheah has anything to add, otherwise I'll end up pushing before EOW. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag
dongjoon-hyun commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag URL: https://github.com/apache/spark/pull/25229#issuecomment-518863944 Hmm. I got it. That rang me [my original question](https://github.com/apache/spark/pull/25229#pullrequestreview-267298632) again. If the users supply `spark.driver.extraJavaOptions`, why do we try to enforce this by Spark-side? > Right now user supplied spark.driver.extraJavaOptions will be ignored because they are part of a properties files For me, the documentation update seems to be enough. How do you think about that? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom flag
dongjoon-hyun edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom flag URL: https://github.com/apache/spark/pull/25229#issuecomment-518863944 Hmm. I got it. That rang me [my original question](https://github.com/apache/spark/pull/25229#pullrequestreview-267298632) again. If the users supply `spark.driver.extraJavaOptions`, why do we try to enforce this by Spark-side? > Right now user supplied spark.driver.extraJavaOptions will be ignored because they are part of a properties files For me, the documentation update seems to be enough. How do you think about that? Did I miss something? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect
shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect URL: https://github.com/apache/spark/pull/25344#discussion_r311300690 ## File path: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala ## @@ -202,4 +204,25 @@ class MsSqlServerIntegrationSuite extends DockerJDBCIntegrationSuite { df2.write.jdbc(jdbcUrl, "datescopy", new Properties) df3.write.jdbc(jdbcUrl, "stringscopy", new Properties) } + + test("SPARK-28151 Test write table with BYTETYPE") { +val tableSchema = StructType(Seq(StructField("serialNum", ByteType, true))) +val tableData = Seq(Row(10)) +val df1 = spark.createDataFrame( + spark.sparkContext.parallelize(tableData), + tableSchema) + +df1.write + .format("jdbc") + .mode("overwrite") + .option("url", jdbcUrl) + .option("dbtable", "testTable") + .save() +val df2 = spark.read + .format("jdbc") + .option("url", jdbcUrl) + .option("dbtable", "byteTable") + .load() +df2.show() Review comment: Yes, would add a check of row counts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect
shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect URL: https://github.com/apache/spark/pull/25344#discussion_r311300562 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ## @@ -550,7 +550,7 @@ object JdbcUtils extends Logging { case ByteType => (stmt: PreparedStatement, row: Row, pos: Int) => -stmt.setInt(pos + 1, row.getByte(pos)) +stmt.setByte(pos + 1, row.getByte(pos)) Review comment: Yes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector
vanzin commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector URL: https://github.com/apache/spark/pull/25135#discussion_r311300497 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala ## @@ -419,6 +416,21 @@ private[kafka010] class KafkaOffsetReader( stopConsumer() _consumer = null // will automatically get reinitialized again } + + private def getPartitions(): ju.Set[TopicPartition] = { +consumer.poll(jt.Duration.ZERO) +var partitions = consumer.assignment() +val startTimeMs = System.currentTimeMillis() Review comment: For this kind of logic it's better to use `System.nanoTime()` which is monotonic. Also you can do a little less computation this way: ``` val deadline = System.nanoTime() + someTimeout; while (... && System.nanoTime() < deadline) { } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs
icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24981#discussion_r311301959 ## File path: python/pyspark/sql/tests/test_pandas_udf_cogrouped_map.py ## @@ -0,0 +1,285 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import datetime +import unittest +import sys + +from collections import OrderedDict +from decimal import Decimal + +from pyspark.sql import Row +from pyspark.sql.functions import array, explode, col, lit, udf, sum, pandas_udf, PandasUDFType +from pyspark.sql.types import * +from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, have_pyarrow, \ +pandas_requirement_message, pyarrow_requirement_message +from pyspark.testing.utils import QuietTest + +if have_pandas: +import pandas as pd +from pandas.util.testing import assert_frame_equal, assert_series_equal + +if have_pyarrow: +import pyarrow as pa + + +""" +Tests below use pd.DataFrame.assign that will infer mixed types (unicode/str) for column names +from kwargs w/ Python 2, so need to set check_column_type=False and avoid this check +""" +if sys.version < '3': +_check_column_type = False +else: +_check_column_type = True + + +@unittest.skipIf( +not have_pandas or not have_pyarrow, +pandas_requirement_message or pyarrow_requirement_message) +class CoGroupedMapPandasUDFTests(ReusedSQLTestCase): + +@property +def data1(self): +return self.spark.range(10).toDF('id') \ +.withColumn("ks", array([lit(i) for i in range(20, 30)])) \ +.withColumn("k", explode(col('ks')))\ +.withColumn("v", col('k') * 10)\ +.drop('ks') + +@property +def data2(self): +return self.spark.range(10).toDF('id') \ +.withColumn("ks", array([lit(i) for i in range(20, 30)])) \ +.withColumn("k", explode(col('ks'))) \ +.withColumn("v2", col('k') * 100) \ +.drop('ks') + +def test_simple(self): +self._test_merge(self.data1, self.data2) + +def test_left_group_empty(self): +left = self.data1.where(col("id") % 2 == 0) +self._test_merge(left, self.data2) + +def test_right_group_empty(self): +right = self.data2.where(col("id") % 2 == 0) +self._test_merge(self.data1, right) + +def test_different_schemas(self): +right = self.data2.withColumn('v3', lit('a')) +self._test_merge(self.data1, right, 'id long, k int, v int, v2 int, v3 string') + +def test_complex_group_by(self): +left = pd.DataFrame.from_dict({ +'id': [1, 2, 3], +'k': [5, 6, 7], +'v': [9, 10, 11] +}) + +right = pd.DataFrame.from_dict({ +'id': [11, 12, 13], +'k': [5, 6, 7], +'v2': [90, 100, 110] +}) + +left_df = self.spark\ Review comment: Maybe rename to `left_cdf` (left cogrouped dataframe)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs
icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24981#discussion_r311301959 ## File path: python/pyspark/sql/tests/test_pandas_udf_cogrouped_map.py ## @@ -0,0 +1,285 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import datetime +import unittest +import sys + +from collections import OrderedDict +from decimal import Decimal + +from pyspark.sql import Row +from pyspark.sql.functions import array, explode, col, lit, udf, sum, pandas_udf, PandasUDFType +from pyspark.sql.types import * +from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, have_pyarrow, \ +pandas_requirement_message, pyarrow_requirement_message +from pyspark.testing.utils import QuietTest + +if have_pandas: +import pandas as pd +from pandas.util.testing import assert_frame_equal, assert_series_equal + +if have_pyarrow: +import pyarrow as pa + + +""" +Tests below use pd.DataFrame.assign that will infer mixed types (unicode/str) for column names +from kwargs w/ Python 2, so need to set check_column_type=False and avoid this check +""" +if sys.version < '3': +_check_column_type = False +else: +_check_column_type = True + + +@unittest.skipIf( +not have_pandas or not have_pyarrow, +pandas_requirement_message or pyarrow_requirement_message) +class CoGroupedMapPandasUDFTests(ReusedSQLTestCase): + +@property +def data1(self): +return self.spark.range(10).toDF('id') \ +.withColumn("ks", array([lit(i) for i in range(20, 30)])) \ +.withColumn("k", explode(col('ks')))\ +.withColumn("v", col('k') * 10)\ +.drop('ks') + +@property +def data2(self): +return self.spark.range(10).toDF('id') \ +.withColumn("ks", array([lit(i) for i in range(20, 30)])) \ +.withColumn("k", explode(col('ks'))) \ +.withColumn("v2", col('k') * 100) \ +.drop('ks') + +def test_simple(self): +self._test_merge(self.data1, self.data2) + +def test_left_group_empty(self): +left = self.data1.where(col("id") % 2 == 0) +self._test_merge(left, self.data2) + +def test_right_group_empty(self): +right = self.data2.where(col("id") % 2 == 0) +self._test_merge(self.data1, right) + +def test_different_schemas(self): +right = self.data2.withColumn('v3', lit('a')) +self._test_merge(self.data1, right, 'id long, k int, v int, v2 int, v3 string') + +def test_complex_group_by(self): +left = pd.DataFrame.from_dict({ +'id': [1, 2, 3], +'k': [5, 6, 7], +'v': [9, 10, 11] +}) + +right = pd.DataFrame.from_dict({ +'id': [11, 12, 13], +'k': [5, 6, 7], +'v2': [90, 100, 110] +}) + +left_df = self.spark\ Review comment: Maybe rename to `left_gdf` (left grouped dataframe)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder
yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder URL: https://github.com/apache/spark/pull/24983#discussion_r311303183 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ## @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper { * extended with the set of connected/unconnected plans. */ case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int]) + +/** + * Reorder the joins using a genetic algorithm. The algorithm treat the reorder problem + * to a traveling salesmen problem, and use genetic algorithm give an optimized solution. + * + * The implementation refs the geqo in postgresql, which is contibuted by Darrell Whitley: + * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html + * + * For more info about genetic algorithm and the edge recombination crossover, pls see: + * "A Genetic Algorithm Tutorial, Darrell Whitley" + * https://link.springer.com/article/10.1007/BF00175354 + * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge Recombination Operator, + * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238 + * respectively. + */ +object JoinReorderGA extends PredicateHelper with Logging { + + def search( + conf: SQLConf, + items: Seq[LogicalPlan], + conditions: Set[Expression], + output: Seq[Attribute]): Option[LogicalPlan] = { + +val startTime = System.nanoTime() + +val itemsWithIndex = items.zipWithIndex.map { + case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0)) +}.toMap + +val topOutputSet = AttributeSet(output) + +val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve + +val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000) +logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of items: " + +s"${items.length}, number of plans in memo: ${ pop.chromos.size}") + +assert(pop.chromos.head.basicPlans.size == items.length) +pop.chromos.head.integratedPlan match { + case Some(joinPlan) => joinPlan.plan match { +case p @ Project(projectList, _: Join) if projectList != output => + assert(topOutputSet == p.outputSet) + // Keep the same order of final output attributes. + Some(p.copy(projectList = output)) +case finalPlan if !sameOutput(finalPlan, output) => + Some(Project(output, finalPlan)) +case finalPlan => + Some(finalPlan) + } + case _ => None +} + } +} + +/** + * A pair of parent individuals can breed a child with certain crossover process. + * With crossover, child can inherit gene from its parents, and these gene snippets + * finally compose a new [[Chromosome]]. + */ +@DeveloperApi +trait Crossover { + + /** + * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s, + * with this crossover algorithm. + */ + def newChromo(father: Chromosome, mother: Chromosome) : Chromosome +} + +case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]]) Review comment: Make it a inner class of `EdgeRecombination`? I don't think logically it's in the same level as other classes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder
yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder URL: https://github.com/apache/spark/pull/24983#discussion_r311303978 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ## @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper { * extended with the set of connected/unconnected plans. */ case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int]) + +/** + * Reorder the joins using a genetic algorithm. The algorithm treat the reorder problem + * to a traveling salesmen problem, and use genetic algorithm give an optimized solution. + * + * The implementation refs the geqo in postgresql, which is contibuted by Darrell Whitley: + * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html + * + * For more info about genetic algorithm and the edge recombination crossover, pls see: + * "A Genetic Algorithm Tutorial, Darrell Whitley" + * https://link.springer.com/article/10.1007/BF00175354 + * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge Recombination Operator, + * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238 + * respectively. + */ +object JoinReorderGA extends PredicateHelper with Logging { + + def search( + conf: SQLConf, + items: Seq[LogicalPlan], + conditions: Set[Expression], + output: Seq[Attribute]): Option[LogicalPlan] = { + +val startTime = System.nanoTime() + +val itemsWithIndex = items.zipWithIndex.map { + case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0)) +}.toMap + +val topOutputSet = AttributeSet(output) + +val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve + +val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000) +logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of items: " + +s"${items.length}, number of plans in memo: ${ pop.chromos.size}") + +assert(pop.chromos.head.basicPlans.size == items.length) +pop.chromos.head.integratedPlan match { + case Some(joinPlan) => joinPlan.plan match { +case p @ Project(projectList, _: Join) if projectList != output => + assert(topOutputSet == p.outputSet) + // Keep the same order of final output attributes. + Some(p.copy(projectList = output)) +case finalPlan if !sameOutput(finalPlan, output) => + Some(Project(output, finalPlan)) +case finalPlan => + Some(finalPlan) + } + case _ => None +} + } +} + +/** + * A pair of parent individuals can breed a child with certain crossover process. + * With crossover, child can inherit gene from its parents, and these gene snippets + * finally compose a new [[Chromosome]]. + */ +@DeveloperApi +trait Crossover { + + /** + * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s, + * with this crossover algorithm. + */ + def newChromo(father: Chromosome, mother: Chromosome) : Chromosome +} + +case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]]) + +/** + * This class implements the Genetic Edge Recombination algorithm. + * For more information about the Genetic Edge Recombination, + * see "Scheduling Problems and Traveling Salesmen: The Genetic Edge + * Recombination Operator" by Darrell Whitley et al. + * https://dl.acm.org/citation.cfm?id=657238 + */ +object EdgeRecombination extends Crossover { + + def genEdgeTable(father: Chromosome, mother: Chromosome) : EdgeTable = { +val fatherTable = father.basicPlans.map(g => g -> findNeighbours(father.basicPlans, g)).toMap +val motherTable = mother.basicPlans.map(g => g -> findNeighbours(mother.basicPlans, g)).toMap +EdgeTable( + fatherTable.map(entry => entry._1 -> (entry._2 ++ motherTable(entry._1 + } + + def findNeighbours(genes: Seq[JoinPlan], g: JoinPlan) : Seq[JoinPlan] = { +val genesIndexed = genes.toIndexedSeq +val index = genesIndexed.indexOf(g) +val length = genes.size +if (index > 0 && index < length - 1) { + Seq(genesIndexed(index - 1), genesIndexed(index + 1)) +} else if (index == 0) { + Seq(genesIndexed(1), genesIndexed(length - 1)) +} else if (index == length - 1) { + Seq(genesIndexed(0), genesIndexed(length - 2)) +} else { + Seq() +} + } + + override def newChromo(father: Chromosome, mother: Chromosome): Chromosome = { +var newGenes: Seq[JoinPlan] = Seq() +// 1. Generate the edge table. +var table = genEdgeTable(father, mother).table +// 2. Choose a start point randomly from the heads of father/mother. +var current = + if (util.Random.nextInt(2) == 0) father.basicPlans.head else mother.basicPlans.head +newGenes :+= current + +var stop = false +while (!stop) { + // 3. Filter out the chosen point from the edge table. + table = table.map( +
[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder
yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder URL: https://github.com/apache/spark/pull/24983#discussion_r311304704 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ## @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper { * extended with the set of connected/unconnected plans. */ case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int]) + +/** + * Reorder the joins using a genetic algorithm. The algorithm treat the reorder problem + * to a traveling salesmen problem, and use genetic algorithm give an optimized solution. + * + * The implementation refs the geqo in postgresql, which is contibuted by Darrell Whitley: + * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html + * + * For more info about genetic algorithm and the edge recombination crossover, pls see: + * "A Genetic Algorithm Tutorial, Darrell Whitley" + * https://link.springer.com/article/10.1007/BF00175354 + * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge Recombination Operator, + * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238 + * respectively. + */ +object JoinReorderGA extends PredicateHelper with Logging { + + def search( + conf: SQLConf, + items: Seq[LogicalPlan], + conditions: Set[Expression], + output: Seq[Attribute]): Option[LogicalPlan] = { + +val startTime = System.nanoTime() + +val itemsWithIndex = items.zipWithIndex.map { + case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0)) +}.toMap + +val topOutputSet = AttributeSet(output) + +val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve + +val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000) +logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of items: " + +s"${items.length}, number of plans in memo: ${ pop.chromos.size}") + +assert(pop.chromos.head.basicPlans.size == items.length) +pop.chromos.head.integratedPlan match { + case Some(joinPlan) => joinPlan.plan match { +case p @ Project(projectList, _: Join) if projectList != output => + assert(topOutputSet == p.outputSet) + // Keep the same order of final output attributes. + Some(p.copy(projectList = output)) +case finalPlan if !sameOutput(finalPlan, output) => + Some(Project(output, finalPlan)) +case finalPlan => + Some(finalPlan) + } + case _ => None +} + } +} + +/** + * A pair of parent individuals can breed a child with certain crossover process. + * With crossover, child can inherit gene from its parents, and these gene snippets + * finally compose a new [[Chromosome]]. + */ +@DeveloperApi +trait Crossover { + + /** + * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s, + * with this crossover algorithm. + */ + def newChromo(father: Chromosome, mother: Chromosome) : Chromosome +} + +case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]]) Review comment: BTW, there's no need to introduce an extra case class. We can simply do `type EdgeTable = Map[JoinPlan, Seq[JoinPlan]]`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.
SparkQA commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. URL: https://github.com/apache/spark/pull/25040#issuecomment-518869859 **[Test build #108729 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108729/testReport)** for PR 25040 at commit [`cff78a1`](https://github.com/apache/spark/commit/cff78a16e691917e812b4cd63bf7544a54af4742). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.
SparkQA removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. URL: https://github.com/apache/spark/pull/25040#issuecomment-518802375 **[Test build #108729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108729/testReport)** for PR 25040 at commit [`cff78a1`](https://github.com/apache/spark/commit/cff78a16e691917e812b4cd63bf7544a54af4742). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector
zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector URL: https://github.com/apache/spark/pull/25135#discussion_r311305080 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala ## @@ -419,6 +416,21 @@ private[kafka010] class KafkaOffsetReader( stopConsumer() _consumer = null // will automatically get reinitialized again } + + private def getPartitions(): ju.Set[TopicPartition] = { +consumer.poll(jt.Duration.ZERO) +var partitions = consumer.assignment() +val startTimeMs = System.currentTimeMillis() +while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) { + // Poll to get the latest assigned partitions + consumer.poll(jt.Duration.ofMillis(100)) Review comment: So using this new API will pull data to driver. Right? The previous `poll(0)` is basically a hack to avoid polling data in driver. Maybe we should ask the Kafka community to add a new API to pull metadata only. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector
zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector URL: https://github.com/apache/spark/pull/25135#discussion_r311305080 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala ## @@ -419,6 +416,21 @@ private[kafka010] class KafkaOffsetReader( stopConsumer() _consumer = null // will automatically get reinitialized again } + + private def getPartitions(): ju.Set[TopicPartition] = { +consumer.poll(jt.Duration.ZERO) +var partitions = consumer.assignment() +val startTimeMs = System.currentTimeMillis() +while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) { + // Poll to get the latest assigned partitions + consumer.poll(jt.Duration.ofMillis(100)) Review comment: So using this new API will pull data to driver. Right? The previous `poll(0)` is basically a hack to avoid fetching data to driver. Maybe we should ask the Kafka community to add a new API to pull metadata only. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.
AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. URL: https://github.com/apache/spark/pull/25040#issuecomment-518870398 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108729/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.
AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. URL: https://github.com/apache/spark/pull/25040#issuecomment-518870396 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.
AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. URL: https://github.com/apache/spark/pull/25040#issuecomment-518870396 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.
AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. URL: https://github.com/apache/spark/pull/25040#issuecomment-518870398 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108729/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag
skonto commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag URL: https://github.com/apache/spark/pull/25229#issuecomment-518871183 @Dooyoung-Hwang that was one of my options. So this PR was intended to protect the user from the spark issues when things dont go well and the driver does not exit on K8s. Anyway some people asked this as well on the related discussion: https://github.com/apache/spark/pull/24796. Anyway works for me what is the best UX? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom flag
skonto edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom flag URL: https://github.com/apache/spark/pull/25229#issuecomment-518871183 @Dooyoung-Hwang that was one of my options. So this PR was intended to protect the user from the spark issues when things dont go well and the driver does not exit on K8s. Anyway some people asked this as well on the related discussion: https://github.com/apache/spark/pull/24796. I dont mind adding a note to the docs, what is the best UX? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder
yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder URL: https://github.com/apache/spark/pull/24983#discussion_r311307700 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ## @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper { * extended with the set of connected/unconnected plans. */ case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int]) + +/** + * Reorder the joins using a genetic algorithm. The algorithm treat the reorder problem + * to a traveling salesmen problem, and use genetic algorithm give an optimized solution. + * + * The implementation refs the geqo in postgresql, which is contibuted by Darrell Whitley: + * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html + * + * For more info about genetic algorithm and the edge recombination crossover, pls see: + * "A Genetic Algorithm Tutorial, Darrell Whitley" + * https://link.springer.com/article/10.1007/BF00175354 + * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge Recombination Operator, + * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238 + * respectively. + */ +object JoinReorderGA extends PredicateHelper with Logging { + + def search( + conf: SQLConf, + items: Seq[LogicalPlan], + conditions: Set[Expression], + output: Seq[Attribute]): Option[LogicalPlan] = { + +val startTime = System.nanoTime() + +val itemsWithIndex = items.zipWithIndex.map { + case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0)) +}.toMap + +val topOutputSet = AttributeSet(output) + +val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve + +val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000) +logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of items: " + +s"${items.length}, number of plans in memo: ${ pop.chromos.size}") + +assert(pop.chromos.head.basicPlans.size == items.length) +pop.chromos.head.integratedPlan match { + case Some(joinPlan) => joinPlan.plan match { +case p @ Project(projectList, _: Join) if projectList != output => + assert(topOutputSet == p.outputSet) + // Keep the same order of final output attributes. + Some(p.copy(projectList = output)) +case finalPlan if !sameOutput(finalPlan, output) => + Some(Project(output, finalPlan)) +case finalPlan => + Some(finalPlan) + } + case _ => None +} + } +} + +/** + * A pair of parent individuals can breed a child with certain crossover process. + * With crossover, child can inherit gene from its parents, and these gene snippets + * finally compose a new [[Chromosome]]. + */ +@DeveloperApi +trait Crossover { + + /** + * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s, + * with this crossover algorithm. + */ + def newChromo(father: Chromosome, mother: Chromosome) : Chromosome +} + +case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]]) + +/** + * This class implements the Genetic Edge Recombination algorithm. + * For more information about the Genetic Edge Recombination, + * see "Scheduling Problems and Traveling Salesmen: The Genetic Edge + * Recombination Operator" by Darrell Whitley et al. + * https://dl.acm.org/citation.cfm?id=657238 + */ +object EdgeRecombination extends Crossover { + + def genEdgeTable(father: Chromosome, mother: Chromosome) : EdgeTable = { +val fatherTable = father.basicPlans.map(g => g -> findNeighbours(father.basicPlans, g)).toMap +val motherTable = mother.basicPlans.map(g => g -> findNeighbours(mother.basicPlans, g)).toMap +EdgeTable( + fatherTable.map(entry => entry._1 -> (entry._2 ++ motherTable(entry._1 + } + + def findNeighbours(genes: Seq[JoinPlan], g: JoinPlan) : Seq[JoinPlan] = { +val genesIndexed = genes.toIndexedSeq +val index = genesIndexed.indexOf(g) +val length = genes.size +if (index > 0 && index < length - 1) { + Seq(genesIndexed(index - 1), genesIndexed(index + 1)) +} else if (index == 0) { + Seq(genesIndexed(1), genesIndexed(length - 1)) +} else if (index == length - 1) { + Seq(genesIndexed(0), genesIndexed(length - 2)) +} else { + Seq() +} + } + + override def newChromo(father: Chromosome, mother: Chromosome): Chromosome = { +var newGenes: Seq[JoinPlan] = Seq() +// 1. Generate the edge table. +var table = genEdgeTable(father, mother).table +// 2. Choose a start point randomly from the heads of father/mother. +var current = + if (util.Random.nextInt(2) == 0) father.basicPlans.head else mother.basicPlans.head +newGenes :+= current + +var stop = false +while (!stop) { + // 3. Filter out the chosen point from the edge table. + table = table.map( +
[GitHub] [spark] viirya commented on a change in pull request #25360: [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql'
viirya commented on a change in pull request #25360: [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql' URL: https://github.com/apache/spark/pull/25360#discussion_r311308029 ## File path: sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql ## @@ -20,29 +20,25 @@ SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1; SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1); -- Aggregate grouped by literals (hash aggregate). -SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1; +SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1); -- Aggregate grouped by literals (sort aggregate). -SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1; +SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1); -- Aggregate with complex GroupBy expressions. SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b; SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1; - --- [SPARK-28445] Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used --- The following query will make Scala UDF work, but Python and Pandas udfs will fail with an AnalysisException. --- The query should be added after SPARK-28445. --- SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1); +SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1); Review comment: From the diff, I think original query is `SELECT a + 1 + 1, COUNT(b) FROM testData GROUP BY a + 1`, why not `SELECT udf(a + 1) + 1, udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)` here? So the result should not be a diff? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge opened a new pull request #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined
jzhuge opened a new pull request #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined URL: https://github.com/apache/spark/pull/25372 ## What changes were proposed in this pull request? LookupCatalog.sessionCatalog logs an error message and the exception stack upon any nonfatal exception. When session catalog is not defined, this may alarm the user unnecessarily. It should be enough to give a warning and return None. ## How was this patch tested? Manually run `spark.sessionState.analyzer.sessionCatalog` in Spark shell, expect a warning, not an error and a stack trace. ``` scala> spark.sessionState.analyzer.sessionCatalog 2019-08-06 15:43:23,068 WARN [main] hive.HiveSessionStateBuilder$$anon$1 (Logging.scala:logWarning(69)) - Session catalog is not defined res1: Option[org.apache.spark.sql.catalog.v2.CatalogPlugin] = None ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined
AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined URL: https://github.com/apache/spark/pull/25372#issuecomment-518873586 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13815/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined
AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined URL: https://github.com/apache/spark/pull/25372#issuecomment-518873584 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined
SparkQA commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined URL: https://github.com/apache/spark/pull/25372#issuecomment-518874024 **[Test build #108733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108733/testReport)** for PR 25372 at commit [`8dd1feb`](https://github.com/apache/spark/commit/8dd1feba8231dc0b73e08935997f4acd8eb957d6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined
AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined URL: https://github.com/apache/spark/pull/25372#issuecomment-518873586 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13815/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined
AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined URL: https://github.com/apache/spark/pull/25372#issuecomment-518873584 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined
AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined URL: https://github.com/apache/spark/pull/25372#issuecomment-518875170 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
SparkQA commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25340#issuecomment-518881881 **[Test build #108731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108731/testReport)** for PR 25340 at commit [`fd9a8fb`](https://github.com/apache/spark/commit/fd9a8fb3543475558d9e80e8791baa6c908e15b8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
SparkQA removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25340#issuecomment-518827893 **[Test build #108731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108731/testReport)** for PR 25340 at commit [`fd9a8fb`](https://github.com/apache/spark/commit/fd9a8fb3543475558d9e80e8791baa6c908e15b8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner.
maropu commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner. URL: https://github.com/apache/spark/pull/25331#discussion_r311317299 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala ## @@ -562,8 +593,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String // FloatConverter private[this] def castToFloat(from: DataType): Any => Any = from match { case StringType => - buildCast[UTF8String](_, s => try s.toString.toFloat catch { -case _: NumberFormatException => null + buildCast[UTF8String](_, s => { +val floatStr = s.toString +try floatStr.toFloat catch { + case _: NumberFormatException => Review comment: Could we check the numbers by a simple query?, e.g., ``` scala> spark.range(10).selectExpr("CAST(double(id) AS STRING) a").write.save("/tmp/test") scala> spark.read.load("/tmp/test").selectExpr("CAST(a AS DOUBLE)").write.format("noop").save() ``` In another pr, I observed that a logic depending on exceptions cause high performance penalties: https://github.com/lz4/lz4-java/pull/143 So, I'm a bit worried that this current logic has the same issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25340#issuecomment-518882255 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25340#issuecomment-518882256 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108731/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25340#issuecomment-518882256 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108731/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
AmplabJenkins removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25340#issuecomment-518882255 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect
maropu commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect URL: https://github.com/apache/spark/pull/25344#discussion_r311319686 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ## @@ -550,7 +550,7 @@ object JdbcUtils extends Logging { case ByteType => (stmt: PreparedStatement, row: Row, pos: Int) => -stmt.setInt(pos + 1, row.getByte(pos)) +stmt.setByte(pos + 1, row.getByte(pos)) Review comment: we don't have any test for this code path now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nvander1 edited a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
nvander1 edited a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#issuecomment-517880269 @HyukjinKwon I think it's all good now. Added forall from https://github.com/apache/spark/pull/24761 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#issuecomment-518884999 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression
viirya commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression URL: https://github.com/apache/spark/pull/25215#issuecomment-518884995 @shivusondur @HyukjinKwon The analysis exception by adding udf to group by, is caused by SPARK-28386, SPARK-26741. ``` == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: cannot resolve '`b`' given input columns: [CAST(udf(cast(b as string)) AS INT), CAST(udf(cast(c as string)) AS STRING)]; line 2 pos 63; 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- Project [CAST(udf(cast(b as string)) AS INT)#x, CAST(udf(cast(c as string)) AS STRING)#x] +- Filter (cast(udf(cast(count(1)#xL as string)) as bigint) = cast(1 as bigint)) +- Aggregate [cast(udf(cast(b#x as string)) as int), cast(udf(cast(c#x as string)) as string)], [cast(udf(cast(b#x as string)) as int) AS CAST(udf(cast(b as string)) AS INT)#x, cast(udf(cast(c#x as string)) as string) AS CAST(udf(cast(c as string)) AS STRING)#x, count(1) AS count(1)#xL] +- SubqueryAlias `default`.`test_having` +- Relation[a#x,b#x,c#x,d#x] parquet ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#issuecomment-518884999 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#issuecomment-518885010 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13816/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#issuecomment-518885010 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13816/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
SparkQA commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#issuecomment-518885405 **[Test build #108734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108734/testReport)** for PR 24232 at commit [`a8c7ecd`](https://github.com/apache/spark/commit/a8c7ecd27b8d0fcabfd86571eeba801bb5c7e62a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test
maropu commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test URL: https://github.com/apache/spark/pull/25366#issuecomment-518885803 Does https://github.com/apache/spark/pull/25357 afffects the output in this pr? If so, I think we'd be better to merge that pr first to make code changes less. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner.
dilipbiswal commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner. URL: https://github.com/apache/spark/pull/25331#discussion_r311322905 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala ## @@ -562,8 +593,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String // FloatConverter private[this] def castToFloat(from: DataType): Any => Any = from match { case StringType => - buildCast[UTF8String](_, s => try s.toString.toFloat catch { -case _: NumberFormatException => null + buildCast[UTF8String](_, s => { +val floatStr = s.toString +try floatStr.toFloat catch { + case _: NumberFormatException => Review comment: @maropu Ok.. i will give it a try. One thing to note here is that, we didn't introduce a try/catch in this PR. That was existing before. We just added some extra code in the catch block. However, i will give it a try and get back. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #23701: [SPARK-26741][SQL] Allow using aggregate expressions in ORDER BY clause
viirya commented on a change in pull request #23701: [SPARK-26741][SQL] Allow using aggregate expressions in ORDER BY clause URL: https://github.com/apache/spark/pull/23701#discussion_r311321824 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -1722,22 +1723,28 @@ class Analyzer( case ae: AnalysisException => f } - case sort @ Sort(sortOrder, global, aggregate: Aggregate) if aggregate.resolved => - + case sort @ Sort(sortOrder, global, child) + if child.resolved && relatedAggregate(child).isDefined => +// The Aggregate plan may be not a direct child of sort when there is a HAVING clause. +// In that case a `Filter` and/or a `Project` can be present between the `Sort` and the +// related `Aggregate`. +val aggregate = relatedAggregate(child).get // Try resolving the ordering as though it is in the aggregate clause. try { // If a sort order is unresolved, containing references not in aggregate, or containing // `AggregateExpression`, we need to push down it to the underlying aggregate operator. val unresolvedSortOrders = sortOrder.filter { s => -!s.resolved || !s.references.subsetOf(aggregate.outputSet) || containsAggregate(s) +!s.resolved || !s.references.subsetOf(child.outputSet) || containsAggregate(s) } - val aliasedOrdering = -unresolvedSortOrders.map(o => Alias(o.child, "aggOrder")()) - val aggregatedOrdering = aggregate.copy(aggregateExpressions = aliasedOrdering) + val namedExpressionsOrdering = +unresolvedSortOrders.map(_.child match { + case a: Attribute => a Review comment: This `Attribute` case is new. Previously we just alias aggregate expression, why need for attribute here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder
yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder URL: https://github.com/apache/spark/pull/24983#discussion_r311324313 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ## @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper { * extended with the set of connected/unconnected plans. */ case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int]) + +/** + * Reorder the joins using a genetic algorithm. The algorithm treat the reorder problem + * to a traveling salesmen problem, and use genetic algorithm give an optimized solution. + * + * The implementation refs the geqo in postgresql, which is contibuted by Darrell Whitley: + * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html + * + * For more info about genetic algorithm and the edge recombination crossover, pls see: + * "A Genetic Algorithm Tutorial, Darrell Whitley" + * https://link.springer.com/article/10.1007/BF00175354 + * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge Recombination Operator, + * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238 + * respectively. + */ +object JoinReorderGA extends PredicateHelper with Logging { + + def search( + conf: SQLConf, + items: Seq[LogicalPlan], + conditions: Set[Expression], + output: Seq[Attribute]): Option[LogicalPlan] = { + +val startTime = System.nanoTime() + +val itemsWithIndex = items.zipWithIndex.map { + case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0)) +}.toMap + +val topOutputSet = AttributeSet(output) + +val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve + +val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000) +logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of items: " + +s"${items.length}, number of plans in memo: ${ pop.chromos.size}") + +assert(pop.chromos.head.basicPlans.size == items.length) +pop.chromos.head.integratedPlan match { + case Some(joinPlan) => joinPlan.plan match { +case p @ Project(projectList, _: Join) if projectList != output => + assert(topOutputSet == p.outputSet) + // Keep the same order of final output attributes. + Some(p.copy(projectList = output)) +case finalPlan if !sameOutput(finalPlan, output) => + Some(Project(output, finalPlan)) +case finalPlan => + Some(finalPlan) + } + case _ => None +} + } +} + +/** + * A pair of parent individuals can breed a child with certain crossover process. + * With crossover, child can inherit gene from its parents, and these gene snippets + * finally compose a new [[Chromosome]]. + */ +@DeveloperApi +trait Crossover { + + /** + * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s, + * with this crossover algorithm. + */ + def newChromo(father: Chromosome, mother: Chromosome) : Chromosome +} + +case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]]) + +/** + * This class implements the Genetic Edge Recombination algorithm. + * For more information about the Genetic Edge Recombination, + * see "Scheduling Problems and Traveling Salesmen: The Genetic Edge + * Recombination Operator" by Darrell Whitley et al. + * https://dl.acm.org/citation.cfm?id=657238 + */ +object EdgeRecombination extends Crossover { Review comment: Please give a one-or-two-sentence definition of EdgeRecombination? Also give a simple example here about how an edge map is constructed and a new path is generated from two paths? I feel like the example in this paper "The Traveling Salesman and Sequence Scheduling: Quality Solutions Using Genetic Edge Recombination" is fairly straightforward. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #25347: [SPARK-28610][SQL] Allow having a decimal buffer for long sum
maropu commented on issue #25347: [SPARK-28610][SQL] Allow having a decimal buffer for long sum URL: https://github.com/apache/spark/pull/25347#issuecomment-518889291 Yea, I think so. If we have a new flag, IMO it'd be better that spark has the same behivour with the others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs
jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs URL: https://github.com/apache/spark/pull/25330#discussion_r311311567 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2DataFrameSuite.scala ## @@ -141,4 +141,66 @@ class DataSourceV2DataFrameSuite extends QueryTest with SharedSQLContext with Be } } } + + testQuietly("saveAsTable: with defined catalog and table doesn't exist") { Review comment: This and many other tests don't need `with defined catalog` in test name any more since v2 session tests are split into a separate file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs
jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs URL: https://github.com/apache/spark/pull/25330#discussion_r311311184 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2DataFrameSuite.scala ## @@ -19,15 +19,15 @@ package org.apache.spark.sql.sources.v2 import org.scalatest.BeforeAndAfter -import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.{QueryTest, Row} import org.apache.spark.sql.internal.SQLConf.{PARTITION_OVERWRITE_MODE, PartitionOverwriteMode} import org.apache.spark.sql.test.SharedSQLContext class DataSourceV2DataFrameSuite extends QueryTest with SharedSQLContext with BeforeAndAfter { import testImplicits._ before { -spark.conf.set("spark.sql.catalog.testcat", classOf[TestInMemoryTableCatalog].getName) +spark.conf.set(s"spark.sql.catalog.testcat", classOf[TestInMemoryTableCatalog].getName) Review comment: Needed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs
jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs URL: https://github.com/apache/spark/pull/25330#discussion_r311324593 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -648,8 +648,11 @@ class Analyzer( if catalog.isTemporaryTable(ident) => u // temporary views take precedence over catalog table names - case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalogPlugin), ident)) => -loadTable(catalogPlugin, ident).map(DataSourceV2Relation.create).getOrElse(u) + case u @ UnresolvedRelation(CatalogObjectIdentifier(maybeCatalog, ident)) => +maybeCatalog.orElse(sessionCatalog) Review comment: This changes the SELECT path, so more unit tests for table relation are needed, e.g.: ``` test("Relation: basic") { val t1 = "tbl" withTable(t1) { sql(s"CREATE TABLE $t1 USING $v2Format AS SELECT id, data FROM source") checkAnswer(spark.table(t1), spark.table("source")) checkAnswer(sql(s"TABLE $t1"), spark.table("source")) checkAnswer(sql(s"SELECT * FROM $t1"), spark.table("source")) } } ``` IMHO this change v2 session catalog and related tests in `DataSourceV2SessionDataFrameSuite` can be split to a separate PR with additional unit tests for SQL SELECT and spark.table(). Another PR to update `ResolveInsertInto` to support v2 session catalog in SQL INSERT INTO and DFW.insertInto. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs
jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs URL: https://github.com/apache/spark/pull/25330#discussion_r311310995 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceDFWriterV2SessionCatalogSuite.scala ## @@ -0,0 +1,146 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2 + +import java.util +import java.util.concurrent.ConcurrentHashMap + +import scala.collection.JavaConverters._ + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.catalog.v2.Identifier +import org.apache.spark.sql.catalog.v2.expressions.Transform +import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException +import org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.util.CaseInsensitiveStringMap + +class DataSourceDFWriterV2SessionCatalogSuite Review comment: How about `DataSourceV2SessionDataFrameSuite`? Match the name `DataSourceV2DataFrameSuite` better. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner.
maropu commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner. URL: https://github.com/apache/spark/pull/25331#discussion_r311325089 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala ## @@ -562,8 +593,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String // FloatConverter private[this] def castToFloat(from: DataType): Any => Any = from match { case StringType => - buildCast[UTF8String](_, s => try s.toString.toFloat catch { -case _: NumberFormatException => null + buildCast[UTF8String](_, s => { +val floatStr = s.toString +try floatStr.toFloat catch { + case _: NumberFormatException => Review comment: Yea, I mean the performance for known strings as @viirya said. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs
jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs URL: https://github.com/apache/spark/pull/25330#discussion_r311324593 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -648,8 +648,11 @@ class Analyzer( if catalog.isTemporaryTable(ident) => u // temporary views take precedence over catalog table names - case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalogPlugin), ident)) => -loadTable(catalogPlugin, ident).map(DataSourceV2Relation.create).getOrElse(u) + case u @ UnresolvedRelation(CatalogObjectIdentifier(maybeCatalog, ident)) => +maybeCatalog.orElse(sessionCatalog) Review comment: This changes the SELECT path, so more unit tests for table relation are needed, e.g.: ``` test("Relation: basic") { val t1 = "tbl" withTable(t1) { sql(s"CREATE TABLE $t1 USING $v2Format AS SELECT id, data FROM source") checkAnswer(spark.table(t1), spark.table("source")) checkAnswer(sql(s"TABLE $t1"), spark.table("source")) checkAnswer(sql(s"SELECT * FROM $t1"), spark.table("source")) } } ``` IMHO this v2 session catalog support and related tests can be split to a separate PR with additional unit tests for SQL SELECT and spark.table(). Also another PR to update `ResolveInsertInto` to support v2 session catalog in SQL INSERT INTO and DFW.insertInto. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #25360: [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql'
HyukjinKwon commented on a change in pull request #25360: [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql' URL: https://github.com/apache/spark/pull/25360#discussion_r311327632 ## File path: sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql ## @@ -20,29 +20,25 @@ SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1; SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1); -- Aggregate grouped by literals (hash aggregate). -SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1; +SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1); -- Aggregate grouped by literals (sort aggregate). -SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1; +SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1); -- Aggregate with complex GroupBy expressions. SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b; SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1; - --- [SPARK-28445] Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used --- The following query will make Scala UDF work, but Python and Pandas udfs will fail with an AnalysisException. --- The query should be added after SPARK-28445. --- SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1); +SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1); Review comment: Yes, I missed this diff. Could we match? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #25310: [SPARK-28578][INFRA] Improve Github pull request template
HyukjinKwon commented on a change in pull request #25310: [SPARK-28578][INFRA] Improve Github pull request template URL: https://github.com/apache/spark/pull/25310#discussion_r311328281 ## File path: .github/PULL_REQUEST_TEMPLATE ## @@ -1,10 +1,40 @@ -## What changes were proposed in this pull request? + -(Please fill in changes proposed in this fix) +### What changes were proposed in this pull request? + -## How was this patch tested? -(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) -(If this patch involves UI changes, please attach a screenshot; otherwise, remove this) +### Why are the changes needed? + -Please review https://spark.apache.org/contributing.html before opening a pull request. +### Does this PR introduce any user-facing change? Review comment: Basically yea. For instance, if we meet a breaking change after upgrading to higher Spark, time to time we should look through the commit history. I think it's better to clarify it if that can be identified early. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#issuecomment-518894131 **[Test build #108735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108735/testReport)** for PR 22282 at commit [`23dc81c`](https://github.com/apache/spark/commit/23dc81c7f5d479d92746bf19aeff9f69d6a32424). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations
zhengruifeng commented on a change in pull request #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations URL: https://github.com/apache/spark/pull/25324#discussion_r311330089 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala ## @@ -100,19 +100,26 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashUDF = udf { terms: Seq[_] => - val numOfFeatures = $(numFeatures) - val isBinary = $(binary) - val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0) - terms.foreach { term => -val i = indexOf(term) -if (isBinary) { +val localNumFeatures = $(numFeatures) + +val hashUDF = if ($(binary)) { Review comment: ok, I misunderstood the comments This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing
gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing URL: https://github.com/apache/spark/pull/25328#discussion_r311330990 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala ## @@ -207,4 +206,14 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto } } } + + test("SPARK-28595: explain should not trigger partition listing") { +HiveCatalogMetrics.reset() +withTable("t") { + sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j") + assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0) + spark.table("t").explain() + assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0) Review comment: Add a test case that can return non-zero when `spark.sql.legacy.bucketedTableScan.outputOrdering` is set true? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing
gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing URL: https://github.com/apache/spark/pull/25328#discussion_r311330990 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala ## @@ -207,4 +206,14 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto } } } + + test("SPARK-28595: explain should not trigger partition listing") { +HiveCatalogMetrics.reset() +withTable("t") { + sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j") + assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0) + spark.table("t").explain() + assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0) Review comment: Add a test case that can return non-zero when `spark.sql.legacy.bucketedTableScan.outputOrdering` is set to true? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on issue #25328: [SPARK-28595][SQL] explain should not trigger partition listing
gatorsmile commented on issue #25328: [SPARK-28595][SQL] explain should not trigger partition listing URL: https://github.com/apache/spark/pull/25328#issuecomment-518896033 LGTM except one comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations
AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations URL: https://github.com/apache/spark/pull/25324#issuecomment-518896481 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13817/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations
AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations URL: https://github.com/apache/spark/pull/25324#issuecomment-518896478 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations
AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations URL: https://github.com/apache/spark/pull/25324#issuecomment-518896478 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations
AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations URL: https://github.com/apache/spark/pull/25324#issuecomment-518896481 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13817/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations
SparkQA commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations URL: https://github.com/apache/spark/pull/25324#issuecomment-518896810 **[Test build #108736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108736/testReport)** for PR 25324 at commit [`0d5ca8c`](https://github.com/apache/spark/commit/0d5ca8cbb992c21d16465088ae8b2bb8bd5cb8f4). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #25349: [SPARK-28538][UI][WIP] Document SQL page
zhengruifeng commented on a change in pull request #25349: [SPARK-28538][UI][WIP] Document SQL page URL: https://github.com/apache/spark/pull/25349#discussion_r311332004 ## File path: docs/monitoring.md ## @@ -40,6 +40,100 @@ To view the web UI after the fact, set `spark.eventLog.enabled` to true before s application. This configures Spark to log Spark events that encode the information displayed in the UI to persisted storage. +## Web UI Tabs +The web UI provides an overview of the Spark cluster and is composed of following tabs: + +### Jobs Tab +The Jobs tab displays a summary page of all jobs in the Spark application and a detailed page +for each job. The summary page shows high-level information, such as the status, duration, and +progress of all jobs and the overall event timeline. When you click on a job on the summary +page, you see the detailed page for that job. The detailed page further shows the event timeline, +DAG visualization, and all stages of the job. + +### Stages Tab +The Stages tab displays a summary page that shows the current state of all stages of all jobs in +the Spark application, and, when you click on a stage, a detailed page for that stage. The details +page shows the event timeline, DAG visualization, and all tasks for the stage. + +### Storage Tab +The Storage tab displays the persisted RDDs, if any, in the application. The summary page shows +the storage levels, sizes and partitions of all RDDs, and the detailed page shows the sizes and +using executors for all partitions in an RDD. + +### Environment Tab +The Environment tab displays the values for the different environment and configuration variables, +including JVM, Spark, and system properties. + +### Executors Tab +The Executors tab displays summary information about the executors that were created for the +application, including memory and disk usage and task and shuffle information. The Storage Memory +column shows the amount of memory used and reserved for caching data. + +### SQL Tab +If the application executes Spark SQL queries, the SQL tab displays information, such as the duration, +jobs, and physical and logical plans for the queries. Here we include a basic example to illustrate +this tab: +{% highlight scala %} +scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", "name") +df: org.apache.spark.sql.DataFrame = [count: int, name: string] + +scala> df.count +res0: Long = 3 + +scala> df.createGlobalTempView("df") + +scala> spark.sql("select name,sum(count) from global_temp.df group by name").show +++--+ +|name|sum(count)| +++--+ +|andy| 3| +| bob| 2| +++--+ +{% endhighlight %} + + + + + + +Now the above three dataframe/SQL operators are shown in the list. If we click the +'show at \: 24' link of the last query, we will see the DAG of the job. + + + + + + +We can see that detailed information of each stage. The first block 'WholeStageCodegen' +compile multiple operator ('LocalTableScan' and 'HashAggregate') together into a single Java +function to improve performance, and metrics like number of rows and spill size are listed in +the block. The second block 'Exchange' shows the metrics on the shuffle exchange, including +number of written shuffle records, total data size, etc. + + + +
[GitHub] [spark] HyukjinKwon commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression
HyukjinKwon commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression URL: https://github.com/apache/spark/pull/25215#issuecomment-518897609 Thanks, @viirya. @shivusondur Can you comment the test out with those JIRA numbers? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#issuecomment-518898872 **[Test build #108735 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108735/testReport)** for PR 22282 at commit [`23dc81c`](https://github.com/apache/spark/commit/23dc81c7f5d479d92746bf19aeff9f69d6a32424). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #24881: [SPARK-23160][SQL][TEST] Port window.sql
maropu commented on a change in pull request #24881: [SPARK-23160][SQL][TEST] Port window.sql URL: https://github.com/apache/spark/pull/24881#discussion_r311333666 ## File path: sql/core/src/test/resources/sql-tests/inputs/pgSQL/window.sql ## @@ -0,0 +1,1175 @@ +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- Window Functions Testing +-- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/window.sql + +CREATE TEMPORARY VIEW tenk2 AS SELECT * FROM tenk1; + +CREATE TABLE empsalary ( +depname string, +empno integer, +salary int, +enroll_date date +) USING parquet; + +INSERT INTO empsalary VALUES +('develop', 10, 5200, '2007-08-01'), +('sales', 1, 5000, '2006-10-01'), +('personnel', 5, 3500, '2007-12-10'), +('sales', 4, 4800, '2007-08-08'), +('personnel', 2, 3900, '2006-12-23'), +('develop', 7, 4200, '2008-01-01'), +('develop', 9, 4500, '2008-01-01'), +('sales', 3, 4800, '2007-08-01'), +('develop', 8, 6000, '2006-10-01'), +('develop', 11, 5200, '2007-08-15'); + +SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname) FROM empsalary ORDER BY depname, salary; + +SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary) FROM empsalary; + +-- with GROUP BY +SELECT four, ten, SUM(SUM(four)) OVER (PARTITION BY four), AVG(ten) FROM tenk1 +GROUP BY four, ten ORDER BY four, ten; + +SELECT depname, empno, salary, sum(salary) OVER w FROM empsalary WINDOW w AS (PARTITION BY depname); + +-- [SPARK-28064] Order by does not accept a call to rank() +-- SELECT depname, empno, salary, rank() OVER w FROM empsalary WINDOW w AS (PARTITION BY depname ORDER BY salary) ORDER BY rank() OVER w; + +-- empty window specification +SELECT COUNT(*) OVER () FROM tenk1 WHERE unique2 < 10; + +SELECT COUNT(*) OVER w FROM tenk1 WHERE unique2 < 10 WINDOW w AS (); + +-- no window operation +SELECT four FROM tenk1 WHERE FALSE WINDOW w AS (PARTITION BY ten); + +-- cumulative aggregate +SELECT sum(four) OVER (PARTITION BY ten ORDER BY unique2) AS sum_1, ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT row_number() OVER (ORDER BY unique2) FROM tenk1 WHERE unique2 < 10; + +SELECT rank() OVER (PARTITION BY four ORDER BY ten) AS rank_1, ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT dense_rank() OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT percent_rank() OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT cume_dist() OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT ntile(3) OVER (ORDER BY ten, four), ten, four FROM tenk1 WHERE unique2 < 10; + +-- [SPARK-28065] ntile does not accept NULL as input +-- SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2; + +SELECT lag(ten) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +-- [SPARK-28068] `lag` second argument must be a literal in Spark +-- SELECT lag(ten, four) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; +-- SELECT lag(ten, four, 0) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT lead(ten) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT lead(ten * 2, 1) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT lead(ten * 2, 1, -1) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT first(ten) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +-- last returns the last row of the frame, which is CURRENT ROW in ORDER BY window. +SELECT last(four) OVER (ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10; + +SELECT last(ten) OVER (PARTITION BY four), ten, four FROM +(SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s +ORDER BY four, ten; + +-- [SPARK-27951] ANSI SQL: NTH_VALUE function +-- SELECT nth_value(ten, four + 1) OVER (PARTITION BY four), ten, four +-- FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s; + +SELECT ten, two, sum(hundred) AS gsum, sum(sum(hundred)) OVER (PARTITION BY two ORDER BY ten) AS wsum +FROM tenk1 GROUP BY ten, two; + +SELECT count(*) OVER (PARTITION BY four), four FROM (SELECT * FROM tenk1 WHERE two = 1)s WHERE unique2 < 10; + +SELECT (count(*) OVER (PARTITION BY four ORDER BY ten) + + sum(hundred) OVER (PARTITION BY four ORDER BY ten)) AS cntsum + FROM tenk1 WHERE unique2 < 10; + +-- opexpr with different windows evaluation. +SELECT * FROM( + SELECT count(*) OVER (PARTITION BY four ORDER BY ten) + +sum(hundred) OVER (PARTITION BY two ORDER BY ten) AS total, +count(*) OVER (PARTITION BY four ORDER BY ten) AS fourcount, +sum(hundred) OVER (PARTITION BY two ORDER BY ten) AS twosum +FROM tenk1 +)sub WHERE total <> fourcount + twosum; + +SELECT avg(four) OVER (PARTITION BY four ORDER
[GitHub] [spark] SparkQA removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
SparkQA removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#issuecomment-518894131 **[Test build #108735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108735/testReport)** for PR 22282 at commit [`23dc81c`](https://github.com/apache/spark/commit/23dc81c7f5d479d92746bf19aeff9f69d6a32424). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#issuecomment-518898951 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108735/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518898909 **[Test build #108732 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108732/testReport)** for PR 25371 at commit [`b7aad47`](https://github.com/apache/spark/commit/b7aad47c0121501343729c7d73a75df74722267e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#issuecomment-518898945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#issuecomment-518898945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#issuecomment-518898951 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108735/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base
SparkQA removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base URL: https://github.com/apache/spark/pull/25371#issuecomment-518856479 **[Test build #108732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108732/testReport)** for PR 25371 at commit [`b7aad47`](https://github.com/apache/spark/commit/b7aad47c0121501343729c7d73a75df74722267e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org