[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 use spark-2.2.0-bin-hadoop2.7 numpy examples/src/main/python/mllib/correlations_example.py ### case 1 |key|value| |---|---| |**PYSPARK_DRIVER_PYTHON**|~/anaconda3/envs/py3/bin/python| |deploy-mode|**client**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv.PYSPARK_PYTHON| py3.zip/py3/bin/python | |failure|Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.| ### case 2 |key|value| |---|---| |**PYSPARK_DRIVER_PYTHON**|~/anaconda3/envs/py3/bin/python| |deploy-mode|**cluster**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv.PYSPARK_PYTHON| py3.zip/py3/bin/python | |failure|java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory at org.apache.spark.**deploy.PythonRunner**$.main(PythonRunner.scala:91)| ### case 3 & 4 |key|value| |---|---| |**PYSPARK_DRIVER_PYTHON**|~/anaconda3/envs/py3/bin/python| |deploy-mode|**cluster(3) client (4)**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_DRIVER_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv. PYSPARK_DRIVER_PYTHON | py3.zip/py3/bin/python | |failure|Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.| ### case 5 && 6 |key|value| |---|---| |**PYSPARK_PYTHON**|~/anaconda3/envs/py3/bin/python| |deploy-mode|**cluster(6)**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_DRIVER_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv. PYSPARK_DRIVER_PYTHON | py3.zip/py3/bin/python | |failure|java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory [**executor side PythonRunner**]| ### case 7 |key|value| |---|---| |**PYSPARK_PYTHON**|~/anaconda3/envs/py3/bin/python| |deploy-mode|**cluster**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv. PYSPARK_DRIVER_PYTHON | py3.zip/py3/bin/python | |**success **|--| ### case 8 |key|value| |---|---| |**PYSPARK_PYTHON**|~/anaconda3/envs/py3/bin/python| |deploy-mode|**cluster**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv. PYSPARK_DRIVER_PYTHON | py3.zip/py3/bin/python | |failure|java.io.IOException: Cannot run program "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or directory [**executor side PythonRunner**]| ### case 9 |key|value| |---|---| |not setting~~PYSPARK_[DRIVER]_PYTHON~~|| |deploy-mode|**client**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv. PYSPARK_DRIVER_PYTHON | py3.zip/py3/bin/python | |failure|ImportError: No module named numpy| ### case 10 |key|value| |---|---| |not setting~~PYSPARK_[DRIVER]_PYTHON~~|| |deploy-mode|**cluster**| |--archives |~/anaconda3/envs/py3.zip| |spark.yarn.appMasterEnv.**PYSPARK_PYTHON**|py3.zip/py3/bin/python| |spark.executorEnv. PYSPARK_DRIVER_PYTHON | py3.zip/py3/bin/python | |**success**| -- | ### my humble opinions 1. spark.executorEnv. PYSPARK_* takes no affect on executor side pythonExec, which is determined by driver. 2. if PYSPARK_PYTHON is specified then **spark.yarn.appMasterEnv.** should be suffixed by **PYSPARK_PYTHON** not ~~PYSPARK_DRIVER_PYTHON~~ 3. specifying PYSPARK_DRIVER_PYTHON fails all the cases, it may be caused by https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191 only deal with PYSPARK_PYTHON --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84408/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19868 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19868 **[Test build #84408 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84408/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17770 @cloud-fan Sure. Seems there is no option to reopen it as it was merged before. Should I create another PR for it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17770 Hi @viirya , since it's close to Spark 2.3, would you like to reopen this PR? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154558750 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) && +sparkSession.conf.get(SQLConf.ORC_ENABLED)) { --- End diff -- Shouldn't we get the conf from `sessionState`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154558153 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,11 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_ENABLED = buildConf("spark.sql.orc.enabled") +.doc("When true, use OrcFileFormat in sql/core module instead of the one in sql/hive module.") --- End diff -- The description should include the major difference of these two orc versions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84410/testReport)** for PR 19869 at commit [`9b8ae3d`](https://github.com/apache/spark/commit/9b8ae3d6635c5ed0323bf088e20d0de55dd1c098). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154556638 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -672,48 +668,56 @@ case class HashAggregateExec( def outputFromRowBasedMap: String = { s""" - while ($iterTermForFastHashMap.next()) { - $numOutput.add(1); - UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); - UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); - $outputFunc($keyTerm, $bufferTerm); - - if (shouldStop()) return; - } - $fastHashMapTerm.close(); - """ + |while ($iterTermForFastHashMap.next()) { + | UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); + | UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); + | $outputFunc($keyTerm, $bufferTerm); + | + | if (shouldStop()) return; + |} + |$fastHashMapTerm.close(); + """.stripMargin } // Iterate over the aggregate rows and convert them from ColumnarRow to UnsafeRow def outputFromVectorizedMap: String = { val row = ctx.freshName("fastHashMapRow") --- End diff -- ah good catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18995: [SPARK-21787][SPARK-22672][SQL] Support for pushing down...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18995 maybe have a PR to move the tests first? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556299 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala --- @@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with SharedSQLContext { .load().schema == StructType(Seq(StructField("stringType", StringType, nullable = false } - test("should fail to load ORC without Hive Support") { -val e = intercept[AnalysisException] { - spark.read.format("orc").load() + test("should fail to load ORC only if spark.sql.orc.enabled=false and without Hive Support") { --- End diff -- I think this test is replaced by https://github.com/apache/spark/pull/19871/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR2788 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556243 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) && --- End diff -- "org.apache.spark.sql.hive.orc.OrcFileFormat" should still point to the old implementation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556215 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { --- End diff -- instead of passing the `SparkSession`, I think we only need `SQLConf` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,11 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_ENABLED = buildConf("spark.sql.orc.enabled") --- End diff -- how about `spark.sql.orc.useNewVersion`? Also let's make it an internal config and enable it by default. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 @vanzin PYSPARK_DRIVER_PYTHON won't work because [context.py#L191](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191) does't deal with it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154554648 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/DriverServiceBootstrapStep.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.submit.steps + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.ServiceBuilder + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.submit.KubernetesDriverSpec +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock + +/** + * Allows the driver to be reachable by executor pods through a headless service. The service's + * ports should correspond to the ports that the executor will reach the pod at for RPC. + */ +private[spark] class DriverServiceBootstrapStep( +kubernetesResourceNamePrefix: String, +driverLabels: Map[String, String], +submissionSparkConf: SparkConf, +clock: Clock) extends DriverConfigurationStep with Logging { + import DriverServiceBootstrapStep._ + + override def configureDriver(driverSpec: KubernetesDriverSpec): KubernetesDriverSpec = { +require(submissionSparkConf.getOption(DRIVER_BIND_ADDRESS_KEY).isEmpty, + s"$DRIVER_BIND_ADDRESS_KEY is not supported in Kubernetes mode, as the driver's bind " + + "address is managed and set to the driver pod's IP address.") +require(submissionSparkConf.getOption(DRIVER_HOST_KEY).isEmpty, + s"$DRIVER_HOST_KEY is not supported in Kubernetes mode, as the driver's hostname will be " + + "managed via a Kubernetes service.") + +val preferredServiceName = s"$kubernetesResourceNamePrefix$DRIVER_SVC_POSTFIX" +val resolvedServiceName = if (preferredServiceName.length <= MAX_SERVICE_NAME_LENGTH) { + preferredServiceName +} else { + val randomServiceId = clock.getTimeMillis() + val shorterServiceName = s"spark-$randomServiceId$DRIVER_SVC_POSTFIX" + logWarning(s"Driver's hostname would preferably be $preferredServiceName, but this is " + +s"too long (must be <= $MAX_SERVICE_NAME_LENGTH characters). Falling back to use " + +s"$shorterServiceName as the driver service's name.") + shorterServiceName +} + +val driverPort = submissionSparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT) +val driverBlockManagerPort = submissionSparkConf.getInt( +org.apache.spark.internal.config.DRIVER_BLOCK_MANAGER_PORT.key, DEFAULT_BLOCKMANAGER_PORT) +val driverService = new ServiceBuilder() + .withNewMetadata() +.withName(resolvedServiceName) +.endMetadata() + .withNewSpec() +.withClusterIP("None") +.withSelector(driverLabels.asJava) +.addNewPort() + .withName(DRIVER_PORT_NAME) + .withPort(driverPort) + .withNewTargetPort(driverPort) + .endPort() +.addNewPort() + .withName(BLOCK_MANAGER_PORT_NAME) + .withPort(driverBlockManagerPort) + .withNewTargetPort(driverBlockManagerPort) + .endPort() +.endSpec() + .build() + +val namespace = submissionSparkConf.get(KUBERNETES_NAMESPACE) +val driverHostname = s"${driverService.getMetadata.getName}.$namespace.svc.cluster.local" +val resolvedSparkConf = driverSpec.driverSparkConf.clone() + .set(org.apache.spark.internal.config.DRIVER_HOST_ADDRESS, driverHostname) --- End diff -- `DRIVER_HOST_KEY`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19871 This is a second PR after #19651 . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154549074 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -251,6 +252,7 @@ object SparkSubmit extends CommandLineUtils with Logging { YARN case m if m.startsWith("spark") => STANDALONE case m if m.startsWith("mesos") => MESOS + case m if m.startsWith("k8s") => KUBERNETES case m if m.startsWith("local") => LOCAL case _ => printErrorAndExit("Master must either be yarn or start with spark, mesos, local") --- End diff -- `Master must either be yarn or start with spark, mesos, k8s, local` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154549172 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -296,6 +298,12 @@ object SparkSubmit extends CommandLineUtils with Logging { case (STANDALONE, CLUSTER) if args.isR => printErrorAndExit("Cluster deploy mode is currently not supported for R " + "applications on standalone clusters.") + case (KUBERNETES, CLIENT) => +printErrorAndExit("Client mode is currently not supported for Kubernetes.") + case (KUBERNETES, _) if args.isPython => +printErrorAndExit("Python applications are currently not supported for Kubernetes.") + case (KUBERNETES, _) if args.isR => +printErrorAndExit("R applications are currently not supported for Kubernetes.") --- End diff -- nit: Not affect the result, but logically I think it is better: ```scala case (KUBERNETES, _) if args.isPython => printErrorAndExit("Python applications are currently not supported for Kubernetes.") case (KUBERNETES, _) if args.isR => printErrorAndExit("R applications are currently not supported for Kubernetes.") case (KUBERNETES, CLIENT) => printErrorAndExit("Client mode is currently not supported for Kubernetes.") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154549733 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -590,6 +604,11 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S | the node running the Application Master via the Secure | Distributed Cache, for renewing the login tickets and the | delegation tokens periodically. +| +| Kubernetes only: +| --kubernetes-namespace NS The namespace in the Kubernetes cluster within which the +| application must be launched. The namespace must already +| exist in the cluster. (Default: default). --- End diff -- There are some messages needed to be updated too, e.g,: | Spark standalone or Mesos with cluster deploy mode only: | --supervise If given, restarts the driver on failure. | --kill SUBMISSION_IDIf given, kills the driver specified. | --status SUBMISSION_ID If given, requests the status of the driver specified. From above, k8s supports killing submission and requesting submission statuses. | Spark standalone and Mesos only: | --total-executor-cores NUM Total cores for all executors. k8s also supports `totalExecutorCores` option. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154553669 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/executor/Dockerfile --- @@ -0,0 +1,31 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM spark-base + +# If this docker file is being used in the context of building your images from a Spark distribution, the docker build +# command should be invoked from the top level directory of the Spark distribution. E.g.: +# docker build -t spark-executor:latest -f dockerfiles/executor/Dockerfile . + +COPY examples /opt/spark/examples + +CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \ +env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt && \ +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt && \ +if ! [ -z ${SPARK_MOUNTED_CLASSPATH}+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && \ +if ! [ -z ${SPARK_EXECUTOR_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXECUTOR_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \ +${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Dspark.executor.port=$SPARK_EXECUTOR_PORT -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --- End diff -- For `SPARK_EXECUTOR_PORT`, I don't see the corresponding `ENV_EXECUTOR_PORT` has been set in `createExecutorPod`. Is it missing? ditto for`SPARK_MOUNTED_CLASSPATH`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19871 **[Test build #84409 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84409/testReport)** for PR 19871 at commit [`37e240c`](https://github.com/apache/spark/commit/37e240cf12ea7463c2e0ea56501b812e12745869). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19871 [SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core ## What changes were proposed in this pull request? This PR aims to provide a configuration to choose the default `OrcFileFormat` from legacy `sql/hive` module or new `sql/core` module. For example, this configuration will affects the following operations. ```scala spark.read.orc(...) ``` ```sql CREATE TABLE t USING ORC ... ``` ## How was this patch tested? Pass the Jenkins with new test suites. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark spark-sql-orc-enabled Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19871.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19871 commit 37e240cf12ea7463c2e0ea56501b812e12745869 Author: Dongjoon HyunDate: 2017-12-03T23:33:55Z [SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19868 **[Test build #84408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84408/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19855: [SPARK-22662] [SQL] Failed to prune columns after rewrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19855 **[Test build #84407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84407/testReport)** for PR 19855 at commit [`f9e9d10`](https://github.com/apache/spark/commit/f9e9d10d4b581379c2af966a35362b77f35a7b6f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19868 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19870 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84406/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19870 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19870 **[Test build #84406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84406/testReport)** for PR 19870 at commit [`fdf10b3`](https://github.com/apache/spark/commit/fdf10b36cbc746e57cb0b8c2f79646db9dd06964). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18995: [SPARK-21787][SPARK-22672][SQL] Support for pushing down...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18995 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84405/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18995: [SPARK-21787][SPARK-22672][SQL] Support for pushing down...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18995 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18995: [SPARK-21787][SPARK-22672][SQL] Support for pushing down...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18995 **[Test build #84405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84405/testReport)** for PR 18995 at commit [`9f6d75b`](https://github.com/apache/spark/commit/9f6d75bece86233baeb277c95b1974ac207c0d32). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19756: [SPARK-22527][SQL] Reuse coordinated exchanges if possib...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19756 ping @cloud-fan @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 ping @cloud-fan @kiszk Do you have more review on this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19870 **[Test build #84406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84406/testReport)** for PR 19870 at commit [`fdf10b3`](https://github.com/apache/spark/commit/fdf10b36cbc746e57cb0b8c2f79646db9dd06964). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19870: [SPARK-22665][SQL] Avoid repartitioning with empt...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19870 [SPARK-22665][SQL] Avoid repartitioning with empty list of expressions ## What changes were proposed in this pull request? Repartitioning by empty set of expressions is currently possible, even though it is a case which is not handled properly. Indeed, in `HashExpression` there is a check to avoid to run it on an empty set, but this check is not performed while repartitioning. Thus, the PR adds a check to avoid this wrong situation. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22665 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19870.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19870 commit fdf10b36cbc746e57cb0b8c2f79646db9dd06964 Author: Marco GaidoDate: 2017-12-03T19:40:01Z [SPARK-22665][SQL] Avoid repartitioning with empty list of expressions --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18995: [SPARK-21787][SQL] Support for pushing down filters for ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18995 Hi, @cloud-fan . This is the first followup after #19651 . Could you review this PR, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18995: [SPARK-21787][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18995#discussion_r154537895 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala --- @@ -15,20 +15,21 @@ * limitations under the License. */ -package org.apache.spark.sql.hive.orc +package org.apache.spark.sql.execution.datasources.orc import java.io.File import scala.reflect.ClassTag import scala.reflect.runtime.universe.TypeTag import org.apache.spark.sql._ -import org.apache.spark.sql.hive.test.TestHiveSingleton import org.apache.spark.sql.test.SQLTestUtils -private[sql] trait OrcTest extends SQLTestUtils with TestHiveSingleton { +abstract class OrcTest extends QueryTest with SQLTestUtils { import testImplicits._ + protected def format: String --- End diff -- By using full canonical class names, we will explicitly test new OrcFileFormat in `sql/core` and old OrcFileFormat in `sql/hive` without considering SQL ORC switching conf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18995: [SPARK-21787][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18995#discussion_r154537824 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala --- @@ -0,0 +1,362 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import java.nio.charset.StandardCharsets +import java.sql.{Date, Timestamp} + +import scala.collection.JavaConverters._ + +import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument} --- End diff -- Please note that this suite use Apache ORC `SearchArgument` and `PredicateLeaf` classes. We cannot share the test code for this suite. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18995: [SPARK-21787][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18995#discussion_r154537798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -82,8 +82,7 @@ private[orc] object OrcFilters { * Both CharType and VarcharType are cleaned at AstBuilder. */ private def isSearchableType(dataType: DataType) = dataType match { -// TODO: SPARK-21787 Support for pushing down filters for DateType in ORC -case BinaryType | DateType => false --- End diff -- This PR supports DateType PPD on new OrcFileFormat. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18995: [SPARK-21787][SQL] Support for pushing down filters for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18995 **[Test build #84405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84405/testReport)** for PR 18995 at commit [`9f6d75b`](https://github.com/apache/spark/commit/9f6d75bece86233baeb277c95b1974ac207c0d32). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154537698 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -304,7 +313,9 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S } private def validateStatusRequestArguments(): Unit = { -if (!master.startsWith("spark://") && !master.startsWith("mesos://")) { +if (!master.startsWith("spark://") + && !master.startsWith("mesos://") + && !master.startsWith("k8s://")) { SparkSubmit.printErrorAndExit( "Requesting submission statuses is only supported in standalone or Mesos mode!") --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154537691 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -294,7 +301,9 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S } private def validateKillArguments(): Unit = { -if (!master.startsWith("spark://") && !master.startsWith("mesos://")) { +if (!master.startsWith("spark://") + && !master.startsWith("mesos://") + && !master.startsWith("k8s://")) { SparkSubmit.printErrorAndExit( "Killing submissions is only supported in standalone or Mesos mode!") --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154537632 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/steps/DependencyResolutionStepSuite.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.submit.steps + +import java.io.File + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.{ContainerBuilder, HasMetadata, PodBuilder} + +import org.apache.spark.{SparkConf, SparkFunSuite} +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.submit.KubernetesDriverSpec + +private[spark] class DependencyResolutionStepSuite extends SparkFunSuite { --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154537561 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -119,5 +139,60 @@ private[spark] object Config extends Logging { "must be a positive integer") .createWithDefault(10) + val WAIT_FOR_APP_COMPLETION = +ConfigBuilder("spark.kubernetes.submission.waitAppCompletion") + .doc("In cluster mode, whether to wait for the application to finish before exiting the " + +"launcher process.") + .booleanConf + .createWithDefault(true) + + val REPORT_INTERVAL = +ConfigBuilder("spark.kubernetes.report.interval") + .doc("Interval between reports of the current app status in cluster mode.") + .timeConf(TimeUnit.MILLISECONDS) + .createWithDefaultString("1s") + + private[spark] val JARS_DOWNLOAD_LOCATION = +ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir") + .doc("Location to download jars to in the driver and executors. When using" + +" spark-submit, this directory must be empty and will be mounted as an empty directory" + +" volume on the driver and executor pod.") + .stringConf + .createWithDefault("/var/spark-data/spark-jars") --- End diff -- This is somehow implementation details and we expect normally users wouldn't need to set or even know about it, so having a default makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84404/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)** for PR 16578 at commit [`981e53b`](https://github.com/apache/spark/commit/981e53bf6b4b23f790ff0bbd457f54f308441076). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19867: [SPARK-22675] [SQL] Deduplicate PropagateTypes in...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/19867 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19867: [SPARK-22675] [SQL] Deduplicate PropagateTypes in TypeCo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19867 After rethinking about it, this is not a right fix. I will close it first. There are multiple issues in the existing `PropagateTypes ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154534333 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -672,48 +668,56 @@ case class HashAggregateExec( def outputFromRowBasedMap: String = { s""" - while ($iterTermForFastHashMap.next()) { - $numOutput.add(1); - UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); - UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); - $outputFunc($keyTerm, $bufferTerm); - - if (shouldStop()) return; - } - $fastHashMapTerm.close(); - """ + |while ($iterTermForFastHashMap.next()) { + | UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); + | UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); + | $outputFunc($keyTerm, $bufferTerm); + | + | if (shouldStop()) return; + |} + |$fastHashMapTerm.close(); + """.stripMargin } // Iterate over the aggregate rows and convert them from ColumnarRow to UnsafeRow def outputFromVectorizedMap: String = { val row = ctx.freshName("fastHashMapRow") --- End diff -- nit: Is it better to fix indentations for these three lines, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19869 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84403/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19869 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84403/testReport)** for PR 19869 at commit [`174f4ec`](https://github.com/apache/spark/commit/174f4ec2ea000de01da6f494db366dc3ff58ccc4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)** for PR 16578 at commit [`981e53b`](https://github.com/apache/spark/commit/981e53bf6b4b23f790ff0bbd457f54f308441076). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19863 @gatorsmile . Since the main PR is merged, I'll include this into the others. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/cor...
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/19863 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16735: [SPARK-19228][SQL] Introduce tryParseDate method to proc...
Github user sergey-rubtsov commented on the issue: https://github.com/apache/spark/pull/16735 I wil try to complete it in this month --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Update ORC data source based on Apach...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18953 This is resolved via https://github.com/apache/spark/pull/19651 . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18953: [SPARK-20682][SQL] Update ORC data source based o...
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/18953 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/19571 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 This is resolved in https://github.com/apache/spark/pull/19651 . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Thank you so much for making ORC move forward, @cloud-fan ! Also, thank you, @HyukjinKwon , @gatorsmile , @viirya , @kiszk . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19869 cc @juliuszsompolski @kiszk @viirya @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154528578 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -882,45 +851,65 @@ case class HashAggregateExec( |${evaluateVariables(unsafeRowBufferEvals)} |// update unsafe row buffer |${updateUnsafeRowBuffer.mkString("\n").trim} - """.stripMargin + """.stripMargin } +val updateRowInHashMap: String = { + if (isFastHashMapEnabled) { +ctx.INPUT_ROW = fastRowBuffer +val boundUpdateExpr = updateExpr.map(BindReferences.bindReference(_, inputAttr)) +val subExprs = ctx.subexpressionEliminationForWholeStageCodegen(boundUpdateExpr) +val effectiveCodes = subExprs.codes.mkString("\n") +val fastRowEvals = ctx.withSubExprEliminationExprs(subExprs.states) { + boundUpdateExpr.map(_.genCode(ctx)) +} +val updateFastRow = fastRowEvals.zipWithIndex.map { case (ev, i) => + val dt = updateExpr(i).dataType + ctx.updateColumn( +fastRowBuffer, dt, i, ev, updateExpr(i).nullable, isVectorizedHashMapEnabled) +} + +// If fast hash map is on, we first generate code to update row in fast hash map, if the +// previous loop up hit fast hash map. Otherwise, update row in regular hash map. +s""" + |if ($fastRowBuffer != null) { + | // common sub-expressions + | $effectiveCodes + | // evaluate aggregate function + | ${evaluateVariables(fastRowEvals)} + | // update fast row + | ${updateFastRow.mkString("\n").trim} + |} else { + | $updateRowInRegularHashMap + |} + """.stripMargin + } else { +updateRowInRegularHashMap --- End diff -- Previously we always declare the `fastRowBuffer` and have the `if (fastRowBuffer != null)` check. Now we don't generate then if fast hash map is not enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154528498 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -784,86 +774,65 @@ case class HashAggregateExec( ("true", "true", "", "") } -// We first generate code to probe and update the fast hash map. If the probe is -// successful the corresponding fast row buffer will hold the mutable row -val findOrInsertFastHashMap: Option[String] = { +val findOrInsertRegularHashMap: String = + s""" + |// generate grouping key + |${unsafeRowKeyCode.code.trim} + |${hashEval.code.trim} + |if ($checkFallbackForBytesToBytesMap) { + | // try to get the buffer from hash map + | $unsafeRowBuffer = + | $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, ${hashEval.value}); + |} + |// Can't allocate buffer from the hash map. Spill the map and fallback to sort-based + |// aggregation after processing all input rows. + |if ($unsafeRowBuffer == null) { + | if ($sorterTerm == null) { + |$sorterTerm = $hashMapTerm.destructAndCreateExternalSorter(); + | } else { + | $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter()); + | } + | $resetCounter + | // the hash map had be spilled, it should have enough memory now, + | // try to allocate buffer again. + | $unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow( + |$unsafeRowKeys, ${hashEval.value}); + | if ($unsafeRowBuffer == null) { + |// failed to allocate the first page + |throw new OutOfMemoryError("No enough memory for aggregation"); + | } + |} + """.stripMargin + +val findOrInsertHashMap: String = { if (isFastHashMapEnabled) { -Option( - s""" - | - |if ($checkFallbackForGeneratedHashMap) { - | ${fastRowKeys.map(_.code).mkString("\n")} - | if (${fastRowKeys.map("!" + _.isNull).mkString(" && ")}) { - |$fastRowBuffer = $fastHashMapTerm.findOrInsert( - |${fastRowKeys.map(_.value).mkString(", ")}); - | } - |} - """.stripMargin) +// If fast hash map is on, we first generate code to probe and update the fast hash map. +// If the probe is successful the corresponding fast row buffer will hold the mutable row. +s""" + |if ($checkFallbackForGeneratedHashMap) { --- End diff -- moved from https://github.com/apache/spark/pull/19869/files#diff-2eb948516b5beaeb746aadac27fbd5b4L794 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84403/testReport)** for PR 19869 at commit [`174f4ec`](https://github.com/apache/spark/commit/174f4ec2ea000de01da6f494db366dc3ff58ccc4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154528455 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -784,86 +774,65 @@ case class HashAggregateExec( ("true", "true", "", "") } -// We first generate code to probe and update the fast hash map. If the probe is -// successful the corresponding fast row buffer will hold the mutable row -val findOrInsertFastHashMap: Option[String] = { +val findOrInsertRegularHashMap: String = --- End diff -- moved from https://github.com/apache/spark/pull/19869/files#diff-2eb948516b5beaeb746aadac27fbd5b4L833 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154528431 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -672,48 +668,56 @@ case class HashAggregateExec( def outputFromRowBasedMap: String = { s""" - while ($iterTermForFastHashMap.next()) { - $numOutput.add(1); - UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); - UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); - $outputFunc($keyTerm, $bufferTerm); - - if (shouldStop()) return; - } - $fastHashMapTerm.close(); - """ + |while ($iterTermForFastHashMap.next()) { + | UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); + | UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); + | $outputFunc($keyTerm, $bufferTerm); + | + | if (shouldStop()) return; + |} + |$fastHashMapTerm.close(); + """.stripMargin } // Iterate over the aggregate rows and convert them from ColumnarRow to UnsafeRow def outputFromVectorizedMap: String = { val row = ctx.freshName("fastHashMapRow") ctx.currentVars = null ctx.INPUT_ROW = row -val generateKeyRow = GenerateUnsafeProjection.createCode(ctx, - groupingKeySchema.toAttributes.zipWithIndex + val generateKeyRow = GenerateUnsafeProjection.createCode(ctx, +groupingKeySchema.toAttributes.zipWithIndex .map { case (attr, i) => BoundReference(i, attr.dataType, attr.nullable) } -) -val generateBufferRow = GenerateUnsafeProjection.createCode(ctx, - bufferSchema.toAttributes.zipWithIndex - .map { case (attr, i) => -BoundReference(groupingKeySchema.length + i, attr.dataType, attr.nullable) }) -s""" - | while ($iterTermForFastHashMap.hasNext()) { - | $numOutput.add(1); - | org.apache.spark.sql.execution.vectorized.ColumnarRow $row = - | (org.apache.spark.sql.execution.vectorized.ColumnarRow) - | $iterTermForFastHashMap.next(); - | ${generateKeyRow.code} - | ${generateBufferRow.code} - | $outputFunc(${generateKeyRow.value}, ${generateBufferRow.value}); - | - | if (shouldStop()) return; - | } - | - | $fastHashMapTerm.close(); - """.stripMargin + ) + val generateBufferRow = GenerateUnsafeProjection.createCode(ctx, +bufferSchema.toAttributes.zipWithIndex.map { case (attr, i) => + BoundReference(groupingKeySchema.length + i, attr.dataType, attr.nullable) +}) + val columnarRowCls = classOf[ColumnarRow].getName + s""" + |while ($iterTermForFastHashMap.hasNext()) { + | $columnarRowCls $row = ($columnarRowCls) $iterTermForFastHashMap.next(); + | ${generateKeyRow.code} + | ${generateBufferRow.code} + | $outputFunc(${generateKeyRow.value}, ${generateBufferRow.value}); + | + | if (shouldStop()) return; + |} + | + |$fastHashMapTerm.close(); + """.stripMargin } +def outputFromRegularHashMap: String = { + s""" + |while ($iterTerm.next()) { --- End diff -- moved from https://github.com/apache/spark/pull/19869/files#diff-2eb948516b5beaeb746aadac27fbd5b4L731 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154528409 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -672,48 +668,56 @@ case class HashAggregateExec( def outputFromRowBasedMap: String = { s""" - while ($iterTermForFastHashMap.next()) { - $numOutput.add(1); - UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); - UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); - $outputFunc($keyTerm, $bufferTerm); - - if (shouldStop()) return; - } - $fastHashMapTerm.close(); - """ + |while ($iterTermForFastHashMap.next()) { + | UnsafeRow $keyTerm = (UnsafeRow) $iterTermForFastHashMap.getKey(); + | UnsafeRow $bufferTerm = (UnsafeRow) $iterTermForFastHashMap.getValue(); + | $outputFunc($keyTerm, $bufferTerm); + | + | if (shouldStop()) return; + |} + |$fastHashMapTerm.close(); + """.stripMargin } // Iterate over the aggregate rows and convert them from ColumnarRow to UnsafeRow def outputFromVectorizedMap: String = { val row = ctx.freshName("fastHashMapRow") ctx.currentVars = null ctx.INPUT_ROW = row -val generateKeyRow = GenerateUnsafeProjection.createCode(ctx, - groupingKeySchema.toAttributes.zipWithIndex --- End diff -- The indentation was wrong previously. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154528386 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -444,6 +444,7 @@ case class HashAggregateExec( val funcName = ctx.freshName("doAggregateWithKeysOutput") val keyTerm = ctx.freshName("keyTerm") val bufferTerm = ctx.freshName("bufferTerm") +val numOutput = metricTerm(ctx, "numOutputRows") --- End diff -- update the `numOutputRows` in the result function instead of doing it for both fast hash map and regular hash map. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19869 [SPARK-22677][SQL] cleanup whole stage codegen for hash aggregate ## What changes were proposed in this pull request? The `HashAggregateExec` whole stage codegen path is a little messy and hard to understand, this code cleans it up a little bit, especially for the fast hash map part. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark hash-agg Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19869.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19869 commit 174f4ec2ea000de01da6f494db366dc3ff58ccc4 Author: Wenchen FanDate: 2017-11-24T12:43:37Z cleanup whole stage codegen for hash aggregate --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19831: [SPARK-22626][SQL] It deals with wrong Hive's sta...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19831 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] It deals with wrong Hive's statistics...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19831 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19865 I am neutral how to fix this problem in the current master. What I am saying from the beginning is that this problem does not only exists in #19811, but also in the current master. I am happy to agree that we fix the invalid case in the current master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19860: [SPARK-22669][SQL] Avoid unnecessary function cal...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19860 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19865 @kiszk I think that in the case you hit them, this might have also been done appositely and relying on the way Java behaves, ie. that it uses the local variable and the global one is not used there. It can also be something which has been designed like that. In this way instead you are forcing a behavior which is not the expected one. I totally agree with @cloud-fan, that we should fix the problem where they are created, if there is any. We can also decide that such situation should be avoided for clarity, and therefore we can change the point where you find this behavior to be present. I am neutral to that. But I disagree in creating a situation which is counterintuitive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19860 LGTM, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19865 @cloud-fan I see. As I pointed out, there are several places to set a global variable `ExprCode.value` that is passed to successor operations. Should we make lifetime of global time local in an operation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19865 I think for this case, shouldn't we fix it and not pass in a global variable into `splitExpressions`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19865 @mgaido91 @viirya As you see, we see an assertion failure. Here is an evidence that we pass a global variable to arguments of split function. In practice, we did not guarantee that we do not pass a global variable. An [value](github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala) was declared as a global variable. Then, it is passed as `ExprCode.value`. Finally, `value` is passed as an argument in `CodeGenContext.splitFunction`. Fortunally, this `expressions` did not update the global variable. Thus, it worked fuctionally correct. In general, it is hard to ensure there is no update in the `expressions`. Of course, we do not like to use regular expressions to detect it. As you said, how do we ensure that we do not pass a global variable? ``` ** File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/feature.py", line 1205, in __main__.MinHashLSH Failed example: ... Caused by: java.lang.AssertionError: assertion failed: smj_value16 in arguments should not be declared as a global variable at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.org$apache$spark$sql$catalyst$expressions$codegen$CodegenContext$$isDeclaredMutableState(CodeGenerator.scala:226) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$9.apply(CodeGenerator.scala:854) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$9.apply(CodeGenerator.scala:854) at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) at scala.collection.AbstractTraversable.filter(Traversable.scala:104) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.splitExpressions(CodeGenerator.scala:853) at org.apache.spark.sql.catalyst.expressions.HashExpression.genHashForStruct(hash.scala:395) at org.apache.spark.sql.catalyst.expressions.HashExpression.computeHashWithTailRec(hash.scala:421) at org.apache.spark.sql.catalyst.expressions.HashExpression.computeHash(hash.scala:429) at org.apache.spark.sql.catalyst.expressions.HashExpression$$anonfun$1.apply(hash.scala:276) at org.apache.spark.sql.catalyst.expressions.HashExpression$$anonfun$1.apply(hash.scala:273) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.expressions.HashExpression.doGenCode(hash.scala:273) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:107) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:104) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:772) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:173) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:162) at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:162) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:36) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:626) at
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19865 making a variable global need to be done manually(call `ctx.mutableState`), splitting the code into methods also need to be done manually(call `ctx.splitExpressions`). If we hit a problem here, it's probably due to misuse. Can we just check the codebase, find these invalid cases and fix them? We may probably add document to `ctx.splitExpression` that global variables should not be in the parameter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19651 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19651 thanks, merging to master! followups: 1. add a config to use new orc by default 2. move orc test to sql core 3. columnar orc reader --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19865 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19865 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84402/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19865 **[Test build #84402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84402/testReport)** for PR 19865 at commit [`f63cd3a`](https://github.com/apache/spark/commit/f63cd3a37f828f86aa2a991bd611813a155645d5). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19824: [SPARK][STREAMING] Invoke onBatchCompletion() onl...
Github user victor-wong closed the pull request at: https://github.com/apache/spark/pull/19824 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19824: [SPARK][STREAMING] Invoke onBatchCompletion() only when ...
Github user victor-wong commented on the issue: https://github.com/apache/spark/pull/19824 @CodingCat Thank you:) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19865 **[Test build #84402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84402/testReport)** for PR 19865 at commit [`f63cd3a`](https://github.com/apache/spark/commit/f63cd3a37f828f86aa2a991bd611813a155645d5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19865 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84401/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19865 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19865 **[Test build #84401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84401/testReport)** for PR 19865 at commit [`ad86986`](https://github.com/apache/spark/commit/ad869866f4aa28eceaf30a1fa09f3d5930a726ad). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19867: [SPARK-22675] [SQL] Deduplicate PropagateTypes in TypeCo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19867 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84400/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19867: [SPARK-22675] [SQL] Deduplicate PropagateTypes in TypeCo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19867 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19867: [SPARK-22675] [SQL] Deduplicate PropagateTypes in TypeCo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19867 **[Test build #84400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84400/testReport)** for PR 19867 at commit [`fc3a24e`](https://github.com/apache/spark/commit/fc3a24e039c905fdd263d2ece2088ce152887fc7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19865 **[Test build #84401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84401/testReport)** for PR 19865 at commit [`ad86986`](https://github.com/apache/spark/commit/ad869866f4aa28eceaf30a1fa09f3d5930a726ad). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19860 @kiszk @viirya I made the following performance test: ``` val a = (1 to 10).map(x => 1).toDS val filtered = a.where($"value".isin((1 to 10): _*)) (1 to 20).map(x=>time(filtered.count)).sum / 20 // where time is an easy function which measures the function time ``` before the PR the average execution time over the 20 trials is 3,428 s, while after the PR it is 3,121 s (on OSX 2,8 GHz Intel Core i7). This means about 10% improvement of the overall performance in this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19868 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84399/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org