[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22870 **[Test build #98184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98184/testReport)** for PR 22870 at commit [`23d31bb`](https://github.com/apache/spark/commit/23d31bb7c55b228d748c8911e2c730db9b59552e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22870 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4581/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22870 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22514#discussion_r228813738 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand( override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = { --- End diff -- I think the table metadata created by data source CTAS and Hive CTAS are different? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22870 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22871 Thanks, @dongjoon-hyun and @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22870 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22870 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98180/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22870 **[Test build #98180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98180/testReport)** for PR 22870 at commit [`23d31bb`](https://github.com/apache/spark/commit/23d31bb7c55b228d748c8911e2c730db9b59552e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...
Github user ouyangxiaochen commented on the issue: https://github.com/apache/spark/pull/22813 @dongjoon-hyun Yes, you can think so. So I want to solve this problem on the spark platform to reduce the risk of some misoperations of operation and maintenance engineer. `WorkDirCleanUp `is only responsible for cleaning up the directories that it generates. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...
Github user ouyangxiaochen commented on the issue: https://github.com/apache/spark/pull/22813 As far as I know, when a spark program is submitted to the cluster, a directory will be created under `SPARK_WORK_DIR`. The directory name consists of application, timestamp, and five-digit serial number. `WorkDirCleanUp `should only delete expired application directories. @srowen Could you tell me if there are other types of directories or files being created under `SPARK_WORK_DIR`? In addition to the application directory. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22871 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22530 **[Test build #98183 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98183/testReport)** for PR 22530 at commit [`9b1cc1d`](https://github.com/apache/spark/commit/9b1cc1d826cb89f0ed6021ae6c8cddc978c0173e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22861 Personally I am against accessing the main args in such way. It looks a bit ugly. But if we have to move everything to `BenchmarkBase`, then maybe this is the way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22865 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22530 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22530 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4580/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22865 thanks, merging to master/2.4/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22530 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22865 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22865 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98176/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22865 **[Test build #98176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98176/testReport)** for PR 22865 at commit [`de22015`](https://github.com/apache/spark/commit/de22015a9b610011ff5616d398b999d39d21eeba). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22326 late LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22847 **[Test build #98182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98182/testReport)** for PR 22847 at commit [`b0ce2ca`](https://github.com/apache/spark/commit/b0ce2cae731620a0ce7417b2b46c24cffb023059). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22721 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4579/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22721 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22870 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22721 **[Test build #98181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98181/testReport)** for PR 22721 at commit [`6c8a73f`](https://github.com/apache/spark/commit/6c8a73f0fe74f618b429dee23869a00e706b125d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22870 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4578/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22870 **[Test build #98180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98180/testReport)** for PR 22870 at commit [`23d31bb`](https://github.com/apache/spark/commit/23d31bb7c55b228d748c8911e2c730db9b59552e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22721 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22830 Who introduced this? We should ask the person that introduced it whether it can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22813 @ouyangxiaochen . Sorry, but the use case sounds like a misconfiguration. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22830: [SPARK-25838][ML] Remove formatVersion from Savea...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22830#discussion_r228800501 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -163,8 +163,6 @@ class LogisticRegressionModel @Since("1.3.0") ( numFeatures, numClasses, weights, intercept, threshold) } - override protected def formatVersion: String = "1.0" - --- End diff -- This seems visible to the users. Although this occurs in Spark 3.0.0, I guess we had better do deprecation processes first for this kind of removals. cc @srowen , @mengxr , @rxin . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22830 **[Test build #98179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98179/testReport)** for PR 22830 at commit [`e9d9e0e`](https://github.com/apache/spark/commit/e9d9e0eb6a7caf0b3fda27bfb3d03fa419e4487e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22830 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4577/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22830 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22830 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98178/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98178 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98178/testReport)** for PR 22784 at commit [`0effc85`](https://github.com/apache/spark/commit/0effc85ccfc831bcc4c469b4a4c1d8db26fab72e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22851 Thank you, @seancxmao and @felixcheung . @seancxmao . Please close this PR since it's merged now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22851 Merged to `branch-2.3`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22861 At least, the following is worth for a separate PR because it's orthogonal `Refactor ... to use main method`. ``` 1. Make mainArgs correctly set in BenchmarkBase. ``` One PR had better have one theme. Putting different themes into one PR together is not a good practice. Please start with the minimal one. If the committer asks some example, then add that later. That's better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22514 I see. Thank you for confirmation, @gatorsmile and @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228789484 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + --- End diff -- Not a big deal but I would avoid abbreviation in documentation. `can't` -> `cannot` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98178 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98178/testReport)** for PR 22784 at commit [`0effc85`](https://github.com/apache/spark/commit/0effc85ccfc831bcc4c469b4a4c1d8db26fab72e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22288: [SPARK-22148][SPARK-15815][Scheduler] Acquire new...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/22288#discussion_r228788350 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -415,9 +420,54 @@ private[spark] class TaskSchedulerImpl( launchedAnyTask |= launchedTaskAtCurrentMaxLocality } while (launchedTaskAtCurrentMaxLocality) } + if (!launchedAnyTask) { - taskSet.abortIfCompletelyBlacklisted(hostToExecutors) + taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors) match { +case Some(taskIndex) => // Returns the taskIndex which was unschedulable + + // If the taskSet is unschedulable we try to find an existing idle blacklisted + // executor. If we cannot find one, we abort immediately. Else we kill the idle + // executor and kick off an abortTimer which if it doesn't schedule a task within the + // the timeout will abort the taskSet if we were unable to schedule any task from the + // taskSet. + // Note 1: We keep track of schedulability on a per taskSet basis rather than on a per + // task basis. + // Note 2: The taskSet can still be aborted when there are more than one idle + // blacklisted executors and dynamic allocation is on. This can happen when a killed + // idle executor isn't replaced in time by ExecutorAllocationManager as it relies on + // pending tasks and doesn't kill executors on idle timeouts, resulting in the abort + // timer to expire and abort the taskSet. + executorIdToRunningTaskIds.find(x => !isExecutorBusy(x._1)) match { +case Some ((executorId, _)) => + if (!unschedulableTaskSetToExpiryTime.contains(taskSet)) { +blacklistTrackerOpt.foreach(blt => blt.killBlacklistedIdleExecutor(executorId)) + +val timeout = conf.get(config.UNSCHEDULABLE_TASKSET_TIMEOUT) * 1000 +unschedulableTaskSetToExpiryTime(taskSet) = clock.getTimeMillis() + timeout +logInfo(s"Waiting for $timeout ms for completely " + + s"blacklisted task to be schedulable again before aborting $taskSet.") +abortTimer.schedule( + createUnschedulableTaskSetAbortTimer(taskSet, taskIndex), timeout) + } +case _ => // Abort Immediately + logInfo("Cannot schedule any task because of complete blacklisting. No idle" + +s" executors can be found to kill. Aborting $taskSet." ) + taskSet.abortSinceCompletelyBlacklisted(taskIndex) + } +case _ => // Do nothing if no tasks completely blacklisted. + } +} else { + // We want to defer killing any taskSets as long as we have a non blacklisted executor + // which can be used to schedule a task from any active taskSets. This ensures that the + // job can make progress and if we encounter a flawed taskSet it will eventually either + // fail or abort due to being completely blacklisted. --- End diff -- so its a bit worse than regular starvation from having competing tasksets, as in this case you might actually have resources available on your cluster, but you never ask for them, because the executor allocation manager thinks you have enough based on the number of pending tasks. In any case, I agree this is a stretch, and overall its an improvement, so I'm OK with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...
Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/22784#discussion_r228788331 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala --- @@ -49,7 +50,16 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) { "Try reducing the parameter k for PCA, or reduce the input feature " + "vector dimension to make this tractable.") -val mat = new RowMatrix(sources) +val mat = if (numFeatures > 65535) { + val meanVector = Statistics.colStats(sources).mean --- End diff -- Thank you @srowen . I have updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22624: [SPARK-23781][CORE] Merge token renewer functionality in...
Github user squito commented on the issue: https://github.com/apache/spark/pull/22624 one minor comment about the `start` api, but otherwise lgtm from the yarn side. would need a bit more time to look at the other cluster managers if nobody else can vouch for those. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228788123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- To be more accurately, I think I should add `When running on HotSpot, a function's bytecode should not go beyond 8KB...`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22624: [SPARK-23781][CORE] Merge token renewer functiona...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/22624#discussion_r228788072 --- Diff: core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala --- @@ -17,76 +17,175 @@ package org.apache.spark.deploy.security +import java.io.File +import java.security.PrivilegedExceptionAction +import java.util.concurrent.{ScheduledExecutorService, TimeUnit} +import java.util.concurrent.atomic.AtomicReference + import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.FileSystem -import org.apache.hadoop.security.Credentials +import org.apache.hadoop.security.{Credentials, UserGroupInformation} import org.apache.spark.SparkConf +import org.apache.spark.deploy.SparkHadoopUtil import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ +import org.apache.spark.rpc.RpcEndpointRef +import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.UpdateDelegationTokens +import org.apache.spark.ui.UIUtils +import org.apache.spark.util.ThreadUtils /** - * Manages all the registered HadoopDelegationTokenProviders and offer APIs for other modules to - * obtain delegation tokens and their renewal time. By default [[HadoopFSDelegationTokenProvider]], - * [[HiveDelegationTokenProvider]] and [[HBaseDelegationTokenProvider]] will be loaded in if not - * explicitly disabled. + * Manager for delegation tokens in a Spark application. + * + * This manager has two modes of operation: + * + * 1. When configured with a principal and a keytab, it will make sure long-running apps can run + * without interruption while accessing secured services. It periodically logs in to the KDC with + * user-provided credentials, and contacts all the configured secure services to obtain delegation + * tokens to be distributed to the rest of the application. + * + * Because the Hadoop UGI API does not expose the TTL of the TGT, a configuration controls how often + * to check that a relogin is necessary. This is done reasonably often since the check is a no-op + * when the relogin is not yet needed. The check period can be overridden in the configuration. * - * Also, each HadoopDelegationTokenProvider is controlled by - * spark.security.credentials.{service}.enabled, and will not be loaded if this config is set to - * false. For example, Hive's delegation token provider [[HiveDelegationTokenProvider]] can be - * enabled/disabled by the configuration spark.security.credentials.hive.enabled. + * New delegation tokens are created once 75% of the renewal interval of the original tokens has + * elapsed. The new tokens are sent to the Spark driver endpoint once it's registered with the AM. + * The driver is tasked with distributing the tokens to other processes that might need them. * - * @param sparkConf Spark configuration - * @param hadoopConf Hadoop configuration - * @param fileSystems Delegation tokens will be fetched for these Hadoop filesystems. + * 2. When operating without an explicit principal and keytab, token renewal will not be available. + * Starting the manager will distribute an initial set of delegation tokens to the provided Spark + * driver, but the app will not get new tokens when those expire. + * + * It can also be used just to create delegation tokens, by calling the `obtainDelegationTokens` + * method. This option does not require calling the `start` method, but leaves it up to the + * caller to distribute the tokens that were generated. */ private[spark] class HadoopDelegationTokenManager( -sparkConf: SparkConf, -hadoopConf: Configuration, -fileSystems: Configuration => Set[FileSystem]) - extends Logging { +protected val sparkConf: SparkConf, +protected val hadoopConf: Configuration) extends Logging { private val deprecatedProviderEnabledConfigs = List( "spark.yarn.security.tokens.%s.enabled", "spark.yarn.security.credentials.%s.enabled") private val providerEnabledConfig = "spark.security.credentials.%s.enabled" - // Maintain all the registered delegation token providers - private val delegationTokenProviders = getDelegationTokenProviders + private val principal = sparkConf.get(PRINCIPAL).orNull + private val keytab = sparkConf.get(KEYTAB).orNull + + require((principal == null) == (keytab == null), +"Both principal and keytab must be defined, or neither.") + require(keytab == null || new File(keytab).isFile(), s"Cannot find keytab at $keytab.") + + private val delegationTokenProviders = loadProviders() logDebug("Using the following builtin delegation token providers: " +
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22871 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98177/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22871 **[Test build #98177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98177/testReport)** for PR 22871 at commit [`34109bc`](https://github.com/apache/spark/commit/34109bc63dd12f733cf2052b048093799f9b0102). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22871 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22666#discussion_r228787126 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala --- @@ -19,14 +19,39 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.util.ArrayBasedMapData -import org.apache.spark.sql.types.{MapType, StringType, StructType} +import org.apache.spark.sql.types.{DataType, MapType, StringType, StructType} +import org.apache.spark.unsafe.types.UTF8String object ExprUtils { - def evalSchemaExpr(exp: Expression): StructType = exp match { -case Literal(s, StringType) => StructType.fromDDL(s.toString) + def evalSchemaExpr(exp: Expression): StructType = { +// Use `DataType.fromDDL` since the type string can be struct<...>. +val dataType = exp match { + case Literal(s, StringType) => +DataType.fromDDL(s.toString) + case e @ SchemaOfCsv(_: Literal, _) => +val ddlSchema = e.eval(EmptyRow).asInstanceOf[UTF8String] +DataType.fromDDL(ddlSchema.toString) + case e => throw new AnalysisException( +"Schema should be specified in DDL format as a string literal or output of " + + s"the schema_of_csv function instead of ${e.sql}") +} + +if (!dataType.isInstanceOf[StructType]) { + throw new AnalysisException( +s"Schema should be struct type but got ${dataType.sql}.") +} +dataType.asInstanceOf[StructType] + } + + def evalTypeExpr(exp: Expression): DataType = exp match { +case Literal(s, StringType) => DataType.fromDDL(s.toString) --- End diff -- Yup, that's what I initially thought that we should allow constant-foldable expressions as well but just decided to follow the initial intent - literal only support. I wasn't also sure about when we would need constant folding to construct a JSON example because I suspected that's usually copied and pasted from, for instance, a file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22666#discussion_r228787018 --- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql --- @@ -7,3 +7,11 @@ select from_csv('1', 'a InvalidType'); select from_csv('1', 'a INT', named_struct('mode', 'PERMISSIVE')); select from_csv('1', 'a INT', map('mode', 1)); select from_csv(); +-- infer schema of json literal +select from_csv('1,abc', schema_of_csv('1,abc')); +select schema_of_csv('1|abc', map('delimiter', '|')); +select schema_of_csv(null); +CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a'); +SELECT schema_of_csv(csvField) FROM csvTable; +-- Clean up +DROP VIEW IF EXISTS csvTable; --- End diff -- actually we don't need to clean up temp views. The golden file test is run with a fresh session. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22666#discussion_r228786427 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala --- @@ -19,14 +19,39 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.util.ArrayBasedMapData -import org.apache.spark.sql.types.{MapType, StringType, StructType} +import org.apache.spark.sql.types.{DataType, MapType, StringType, StructType} +import org.apache.spark.unsafe.types.UTF8String object ExprUtils { - def evalSchemaExpr(exp: Expression): StructType = exp match { -case Literal(s, StringType) => StructType.fromDDL(s.toString) + def evalSchemaExpr(exp: Expression): StructType = { +// Use `DataType.fromDDL` since the type string can be struct<...>. +val dataType = exp match { + case Literal(s, StringType) => +DataType.fromDDL(s.toString) + case e @ SchemaOfCsv(_: Literal, _) => +val ddlSchema = e.eval(EmptyRow).asInstanceOf[UTF8String] +DataType.fromDDL(ddlSchema.toString) + case e => throw new AnalysisException( +"Schema should be specified in DDL format as a string literal or output of " + + s"the schema_of_csv function instead of ${e.sql}") +} + +if (!dataType.isInstanceOf[StructType]) { + throw new AnalysisException( +s"Schema should be struct type but got ${dataType.sql}.") +} +dataType.asInstanceOf[StructType] + } + + def evalTypeExpr(exp: Expression): DataType = exp match { +case Literal(s, StringType) => DataType.fromDDL(s.toString) --- End diff -- we also need to update https://github.com/apache/spark/pull/22666/files#diff-5321c01e95bffc4413c5f3457696213eR157 in case the constant folding rule is disabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22666#discussion_r228785835 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala --- @@ -19,14 +19,39 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.util.ArrayBasedMapData -import org.apache.spark.sql.types.{MapType, StringType, StructType} +import org.apache.spark.sql.types.{DataType, MapType, StringType, StructType} +import org.apache.spark.unsafe.types.UTF8String object ExprUtils { - def evalSchemaExpr(exp: Expression): StructType = exp match { -case Literal(s, StringType) => StructType.fromDDL(s.toString) + def evalSchemaExpr(exp: Expression): StructType = { +// Use `DataType.fromDDL` since the type string can be struct<...>. +val dataType = exp match { + case Literal(s, StringType) => +DataType.fromDDL(s.toString) + case e @ SchemaOfCsv(_: Literal, _) => +val ddlSchema = e.eval(EmptyRow).asInstanceOf[UTF8String] +DataType.fromDDL(ddlSchema.toString) + case e => throw new AnalysisException( +"Schema should be specified in DDL format as a string literal or output of " + + s"the schema_of_csv function instead of ${e.sql}") +} + +if (!dataType.isInstanceOf[StructType]) { + throw new AnalysisException( +s"Schema should be struct type but got ${dataType.sql}.") +} +dataType.asInstanceOf[StructType] + } + + def evalTypeExpr(exp: Expression): DataType = exp match { +case Literal(s, StringType) => DataType.fromDDL(s.toString) --- End diff -- how about ``` if (expr.isFoldable && expr.dataType == StringType) { DataType.fromDDL(expr.eval().asInstanceOf[UTF8String].toString) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22871 **[Test build #98177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98177/testReport)** for PR 22871 at commit [`34109bc`](https://github.com/apache/spark/commit/34109bc63dd12f733cf2052b048093799f9b0102). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22871 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4576/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22871 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r228785348 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -17,6 +17,8 @@ displayTitle: Spark SQL Upgrading Guide - The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set. + - In Spark version 2.4 and earlier, `Dataset` doesn't support to encode `Option[Product]` at top-level row, because in Spark SQL entire top-level row can't be null. Since Spark 3.0, `Option[Product]` at top-level is encoded as a row with single struct column. Then with this support, `Aggregator` can also use use `Option[Product]` as buffer and output column types. --- End diff -- Ok. Let me revert this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r228785236 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -76,7 +76,7 @@ object TypedAggregateExpression { None, bufferSerializer, bufferEncoder.resolveAndBind().deserializer, -outputEncoder.serializer, +outputEncoder.objSerializer, --- End diff -- This is required. Without this, the output schema of aggregator using Option[Product] as output encoder is not correct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22871 cc @BryanCutler and @gatorsmile. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType s...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22871 [SPARK-25179][PYTHON][DOCS] Document BinaryType support in Arrow conversion ## What changes were proposed in this pull request? This PR targets to document binary type in "Apache Arrow in Spark". ## How was this patch tested? Manually built the documentation and checked. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25179 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22871.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22871 commit 34109bc63dd12f733cf2052b048093799f9b0102 Author: hyukjinkwon Date: 2018-10-29T02:53:18Z Document BinaryType support in Arrow conversion --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22849: [SPARK-25852][Core] we should filter the workOffe...
Github user zuotingbing commented on a diff in the pull request: https://github.com/apache/spark/pull/22849#discussion_r228785046 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -240,7 +240,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized { // Filter out executors under killing val activeExecutors = executorDataMap.filterKeys(executorIsAlive) -val workOffers = activeExecutors.map { +val workOffers = activeExecutors.filter(_._2.freeCores > 0).map { --- End diff -- on our cluster there are many executors and tasksets/tasks. as we know there is a round-robin manner to fill each node with tasks which will be scheduled for each second("spark.scheduler.revive.interval", "1s"). it seams make no sense to schedule tasks to executors which have no free cores. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228784274 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil) + case t if isValueClass(t) => +val (_, underlyingType) = getUnderlyingParameterOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +// Nested value class is treated as its underlying type +// because the compiler will convert value class in the schema to +// its underlying type. +// However, for value class that is top-level or array element, +// if it is used as another type (e.g. as its parent trait or generic), +// the compiler keeps the class so we must provide an instance of the +// class too. In other cases, the compiler will handle wrapping/unwrapping +// for us automatically. +val arg = deserializerFor(underlyingType, path, newTypePath, Some(t)) +val isCollectionElement = lastType.exists { lt => + lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]] +} +if (lastType.isEmpty || isCollectionElement) { --- End diff -- it looks to me that we don't need `lastType`, but just a boolean parameter "needInstantiateValueClass". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228783542 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil) + case t if isValueClass(t) => +val (_, underlyingType) = getUnderlyingParameterOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +// Nested value class is treated as its underlying type +// because the compiler will convert value class in the schema to +// its underlying type. +// However, for value class that is top-level or array element, +// if it is used as another type (e.g. as its parent trait or generic), +// the compiler keeps the class so we must provide an instance of the +// class too. In other cases, the compiler will handle wrapping/unwrapping +// for us automatically. +val arg = deserializerFor(underlyingType, path, newTypePath, Some(t)) +val isCollectionElement = lastType.exists { lt => + lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]] --- End diff -- how about map? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 @dongjoon-hyun and @wangyum, please fix my comment if I am wrong at any point - I believe you guys took a look for this part more then I did. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228783130 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -184,7 +193,8 @@ object ScalaReflection extends ScalaReflection { private def deserializerFor( tpe: `Type`, path: Expression, - walkedTypePath: Seq[String]): Expression = cleanUpReflectionObjects { + walkedTypePath: Seq[String], + lastType: Option[Type]): Expression = cleanUpReflectionObjects { --- End diff -- can we add parameter doc for it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 > Does this upgrade Hive for execution or also for metastore? Spark supports virtually all Hive metastore versions out there, and a lot of deployments do run different versions of Spark against the same old Hive metastore, and it'd be bad to break connectivity to old Hive metastores. > The execution part is a different story and we can upgrade them easily. The upgrade basically targets to upgrade Hive for execution (let me know if I am mistaken). For metastore compatibility, I believe we are able to provide metastore jars and support other Hive versions via explicitly configuring the JARs via isolated classloader. I believe we have basic tests for different Hive versions. I would cautiously like to raise an option - drop the builtin metastore support at 3.0 by default if the upgrade makes to keep builtin metastore support hard enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r228782980 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -76,7 +76,7 @@ object TypedAggregateExpression { None, bufferSerializer, bufferEncoder.resolveAndBind().deserializer, -outputEncoder.serializer, +outputEncoder.objSerializer, --- End diff -- to confirm, this is a un-related change and just clean up the code? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r228782790 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -362,4 +362,38 @@ class ScalaReflectionSuite extends SparkFunSuite { assert(numberOfCheckedArguments(deserializerFor[(java.lang.Double, Int)]) == 1) assert(numberOfCheckedArguments(deserializerFor[(java.lang.Integer, java.lang.Integer)]) == 0) } + + test("SPARK-24762: serializer for Option of Product") { --- End diff -- do we need to add tests here? `ScalaReflection` is not updated in this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r228782670 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -207,7 +198,7 @@ case class ExpressionEncoder[T]( val serializer: Seq[NamedExpression] = { val clsName = Utils.getSimpleName(clsTag.runtimeClass) -if (isSerializedAsStruct) { +if (isSerializedAsStruct && !classOf[Option[_]].isAssignableFrom(clsTag.runtimeClass)) { --- End diff -- can we make sure that, other places calling `isSerializedAsStruct` don't need to check Option type? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r228782536 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -17,6 +17,8 @@ displayTitle: Spark SQL Upgrading Guide - The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set. + - In Spark version 2.4 and earlier, `Dataset` doesn't support to encode `Option[Product]` at top-level row, because in Spark SQL entire top-level row can't be null. Since Spark 3.0, `Option[Product]` at top-level is encoded as a row with single struct column. Then with this support, `Aggregator` can also use use `Option[Product]` as buffer and output column types. --- End diff -- Usually we only add migration guide if something is broken and users must be aware of it when upgrading. I think this one is not the case? It's just a new feature. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98175/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18339 **[Test build #98175 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98175/testReport)** for PR 18339 at commit [`ea267c6`](https://github.com/apache/spark/commit/ea267c68c805951c5ee2fb4fccd9f8fb4a288297). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18339 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98174/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18339 **[Test build #98174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98174/testReport)** for PR 18339 at commit [`fa63ba7`](https://github.com/apache/spark/commit/fa63ba7197bc5b7844716cf2efbe62a3f652a7e7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18339 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r228781145 --- Diff: sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out --- @@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1 struct<> -- !query 18 output org.apache.spark.sql.AnalysisException -cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 +cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 --- End diff -- do you think we should just make `sql` same as `name`? It looks to me that `'db1.t1.i1'` is better than `` '`db1`.`t1`.`i1`' ``, as it's more compact and is not ambiguous. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22755: [SPARK-25755][SQL][Test] Supplementation of non-C...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22755#discussion_r228780582 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -166,6 +167,17 @@ private[sql] trait SQLTestUtilsBase super.withSQLConf(pairs: _*)(f) } + /** + * A helper function for turning off/on codegen. + */ + protected def withCodegenTurnOffAndOn(f: String => Unit): Unit = { --- End diff -- nit: `withWholeStageCodegenOnAndOff` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22514#discussion_r228780430 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand( override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = { --- End diff -- Some more thoughts: `CreateHiveTableAsSelectCommand` just runs another command, so we will not get any metric for this plan node. It's OK if we use the hive writer, as we indeed can't get any metrics(the writing is done by hive). However, if we can convert and use Spark's native writer, we do have metrics. I think a better fix is to replace Hive CTAS with data source CTAS during optimization. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22755: [SPARK-25755][SQL][Test] Supplementation of non-CodeGen ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22755 Is it better to apply this util method to others (e.g. `DataFrameRangeSuite.scala` and `DataFrameAggregateSuite.scala`)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22865 **[Test build #98176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98176/testReport)** for PR 22865 at commit [`de22015`](https://github.com/apache/spark/commit/de22015a9b610011ff5616d398b999d39d21eeba). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22514 It's definitely not a blocker, and we don't need to hold RC5 because of it. I think it needs a little more review, and I'm going to cut RC5 today(2.4.0 has already been far delayed), so it's very likely we can't get it into 2.4.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...
Github user yucai commented on the issue: https://github.com/apache/spark/pull/22861 @dongjoon-hyun Originally, I want to do two things in this PR. 1. Make `mainArgs` correctly set in `BenchmarkBase`. 2. Include an example to use `mainArgs`: refactor `DataSourceWriteBenchmark` and `BuiltInDataSourceWriteBenchmark` to use main method. But, refactor `DataSourceWriteBenchmark` will lead to refactor `AvroWriteBenchmark`. Any suggestion how to split PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228779505 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2578,4 +2578,45 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { Row ("abc", 1)) } } + + test("SPARK-25860: Replace Literal(null, _) with FalseLiteral whenever possible") { + +def checkPlanIsEmptyLocalScan(df: DataFrame): Unit = df.queryExecution.executedPlan match { --- End diff -- this assumes we run `ConvertToLocalRelation`, let's use `withSQLConf` to make sure this rule is on. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228779276 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] { flattenConcats(concat) } } + +/** + * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further optimizations. + * + * For example, `Filter(Literal(null, _))` is equal to `Filter(FalseLiteral)`. + * + * Another example containing branches is `Filter(If(cond, FalseLiteral, Literal(null, _)))`; + * this can be optimized to `Filter(If(cond, FalseLiteral, FalseLiteral))`, and eventually + * `Filter(FalseLiteral)`. + * + * As a result, many unnecessary computations can be removed in the query optimization phase. + * + * Similarly, the same logic can be applied to conditions in [[Join]], predicates in [[If]], + * conditions in [[CaseWhen]]. + */ +object ReplaceNullWithFalse extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(cond, _) => f.copy(condition = replaceNullWithFalse(cond)) +case j @ Join(_, _, _, Some(cond)) => j.copy(condition = Some(replaceNullWithFalse(cond))) +case p: LogicalPlan => p transformExpressions { + case i @ If(pred, _, _) => i.copy(predicate = replaceNullWithFalse(pred)) + case CaseWhen(branches, elseValue) => +val newBranches = branches.map { case (cond, value) => + replaceNullWithFalse(cond) -> value +} +CaseWhen(newBranches, elseValue) +} + } + + /** + * Recursively replaces `Literal(null, _)` with `FalseLiteral`. + * + * Note that `transformExpressionsDown` can not be used here as we must stop as soon as we hit + * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or `Literal(null, _)`. + */ + private def replaceNullWithFalse(e: Expression): Expression = e match { +case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) => + val newBranches = cw.branches.map { case (cond, value) => +replaceNullWithFalse(cond) -> replaceNullWithFalse(value) + } + val newElseValue = cw.elseValue.map(replaceNullWithFalse) + CaseWhen(newBranches, newElseValue) +case If(pred, trueVal, falseVal) if Seq(trueVal, falseVal).forall(isNullOrBoolean) => + If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), replaceNullWithFalse(falseVal)) +case And(left, right) => --- End diff -- we need to be careful here. null && fales is false, null || true is true. Please take a look at https://github.com/apache/spark/pull/22702 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228779125 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] { flattenConcats(concat) } } + +/** + * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further optimizations. + * + * For example, `Filter(Literal(null, _))` is equal to `Filter(FalseLiteral)`. + * + * Another example containing branches is `Filter(If(cond, FalseLiteral, Literal(null, _)))`; + * this can be optimized to `Filter(If(cond, FalseLiteral, FalseLiteral))`, and eventually + * `Filter(FalseLiteral)`. + * + * As a result, many unnecessary computations can be removed in the query optimization phase. + * + * Similarly, the same logic can be applied to conditions in [[Join]], predicates in [[If]], + * conditions in [[CaseWhen]]. + */ +object ReplaceNullWithFalse extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(cond, _) => f.copy(condition = replaceNullWithFalse(cond)) +case j @ Join(_, _, _, Some(cond)) => j.copy(condition = Some(replaceNullWithFalse(cond))) +case p: LogicalPlan => p transformExpressions { + case i @ If(pred, _, _) => i.copy(predicate = replaceNullWithFalse(pred)) + case CaseWhen(branches, elseValue) => +val newBranches = branches.map { case (cond, value) => + replaceNullWithFalse(cond) -> value +} +CaseWhen(newBranches, elseValue) +} + } + + /** + * Recursively replaces `Literal(null, _)` with `FalseLiteral`. + * + * Note that `transformExpressionsDown` can not be used here as we must stop as soon as we hit + * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or `Literal(null, _)`. + */ + private def replaceNullWithFalse(e: Expression): Expression = e match { +case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) => --- End diff -- this applies to `If` as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228779097 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] { flattenConcats(concat) } } + +/** + * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further optimizations. + * + * For example, `Filter(Literal(null, _))` is equal to `Filter(FalseLiteral)`. + * + * Another example containing branches is `Filter(If(cond, FalseLiteral, Literal(null, _)))`; + * this can be optimized to `Filter(If(cond, FalseLiteral, FalseLiteral))`, and eventually + * `Filter(FalseLiteral)`. + * + * As a result, many unnecessary computations can be removed in the query optimization phase. + * + * Similarly, the same logic can be applied to conditions in [[Join]], predicates in [[If]], + * conditions in [[CaseWhen]]. + */ +object ReplaceNullWithFalse extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(cond, _) => f.copy(condition = replaceNullWithFalse(cond)) +case j @ Join(_, _, _, Some(cond)) => j.copy(condition = Some(replaceNullWithFalse(cond))) +case p: LogicalPlan => p transformExpressions { + case i @ If(pred, _, _) => i.copy(predicate = replaceNullWithFalse(pred)) + case CaseWhen(branches, elseValue) => +val newBranches = branches.map { case (cond, value) => + replaceNullWithFalse(cond) -> value +} +CaseWhen(newBranches, elseValue) +} + } + + /** + * Recursively replaces `Literal(null, _)` with `FalseLiteral`. + * + * Note that `transformExpressionsDown` can not be used here as we must stop as soon as we hit + * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or `Literal(null, _)`. + */ + private def replaceNullWithFalse(e: Expression): Expression = e match { +case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) => --- End diff -- actually just `cw.dataType == BooleanType`. If an expression is `NullType`, it should be replaced by null literal already. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18339 **[Test build #98175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98175/testReport)** for PR 18339 at commit [`ea267c6`](https://github.com/apache/spark/commit/ea267c68c805951c5ee2fb4fccd9f8fb4a288297). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228779010 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] { flattenConcats(concat) } } + +/** + * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further optimizations. + * + * For example, `Filter(Literal(null, _))` is equal to `Filter(FalseLiteral)`. + * + * Another example containing branches is `Filter(If(cond, FalseLiteral, Literal(null, _)))`; + * this can be optimized to `Filter(If(cond, FalseLiteral, FalseLiteral))`, and eventually + * `Filter(FalseLiteral)`. + * + * As a result, many unnecessary computations can be removed in the query optimization phase. + * + * Similarly, the same logic can be applied to conditions in [[Join]], predicates in [[If]], + * conditions in [[CaseWhen]]. + */ +object ReplaceNullWithFalse extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(cond, _) => f.copy(condition = replaceNullWithFalse(cond)) +case j @ Join(_, _, _, Some(cond)) => j.copy(condition = Some(replaceNullWithFalse(cond))) +case p: LogicalPlan => p transformExpressions { + case i @ If(pred, _, _) => i.copy(predicate = replaceNullWithFalse(pred)) + case CaseWhen(branches, elseValue) => +val newBranches = branches.map { case (cond, value) => + replaceNullWithFalse(cond) -> value +} +CaseWhen(newBranches, elseValue) +} + } + + /** + * Recursively replaces `Literal(null, _)` with `FalseLiteral`. + * + * Note that `transformExpressionsDown` can not be used here as we must stop as soon as we hit + * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or `Literal(null, _)`. + */ + private def replaceNullWithFalse(e: Expression): Expression = e match { +case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) => --- End diff -- how about `cw.dataType == BooleanType || cw.dataType == NullType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18339 **[Test build #98174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98174/testReport)** for PR 18339 at commit [`fa63ba7`](https://github.com/apache/spark/commit/fa63ba7197bc5b7844716cf2efbe62a3f652a7e7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22865#discussion_r228776973 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -462,7 +462,7 @@ object SQLConf { val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordLevelFilter.enabled") .doc("If true, enables Parquet's native record-level filtering using the pushed down " + "filters. This configuration only has an effect when 'spark.sql.parquet.filterPushdown' " + - "is enabled.") + "is enabled and spark.sql.parquet.enableVectorizedReader is disabled.") --- End diff -- SGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22817: [SPARK-25816][SQL] Fix attribute resolution in nested ex...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22817 RC5 will have this fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22870: [SPARK-25862][SQL] Remove rangeBetween APIs intro...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22870#discussion_r228776577 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala --- @@ -206,7 +206,7 @@ case class SpecifiedWindowFrame( // Check combination (of expressions). (lower, upper) match { case (l: Expression, u: Expression) if !isValidFrameBoundary(l, u) => -TypeCheckFailure(s"Window frame upper bound '$upper' does not followes the lower bound " + +TypeCheckFailure(s"Window frame upper bound '$upper' does not follow the lower bound " + --- End diff -- We need to rerun org.apache.spark.sql.SQLQueryTestSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r228776349 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org