[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85160/testReport)** for PR 19884 at commit [`0047f7a`](https://github.com/apache/spark/commit/0047f7a6560bfbb46d7ee28df0c2781f7538b907). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20031: [SPARK-22844][R] Adds date_trunc in R API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20031 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20031: [SPARK-22844][R] Adds date_trunc in R API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20031 **[Test build #85177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85177/testReport)** for PR 20031 at commit [`1c3e956`](https://github.com/apache/spark/commit/1c3e956313b78da492f917c003c38e981cce7877). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20031: [SPARK-22844][R] Adds date_trunc in R API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20031 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85177/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20020 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20020 **[Test build #85161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85161/testReport)** for PR 20020 at commit [`e25a9eb`](https://github.com/apache/spark/commit/e25a9eb285d56a771a56b77534413be59b9f111b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait DataWritingCommand extends Command ` * `case class DataWritingCommandExec(cmd: DataWritingCommand, children: Seq[SparkPlan])` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20021 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85164/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20021 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/19954 > I don't think they are independent as architecturally they make sense together and represent a single concern: enabling use of remote dependencies through init-containers. Missing any one of the three makes the feature unusable. I would also argue that it won't necessarily make review easier as reviewers need to mentally connect them together to make sense of each change set. I agree with this. This is pretty much one cohesive unit and splitting it up is going to probably lead to more difficulty in understanding it. From your comments @vanzin, it seems we definitely do need a good refactor here, and the community can undertake that in Q1 2018. This approach and code has been functionally tested over the last 3 releases of our fork - and I'd be fairly confident about its efficacy - broad changes at this point seem riskier to me from a 2.3 release perspective given that we're still in the process of improving spark-k8s integration testing coverage against apache/spark. cc/ @mccheah --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19884 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85165/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19884 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85165/testReport)** for PR 19884 at commit [`d92ae90`](https://github.com/apache/spark/commit/d92ae90e05f55955eaad8e7f55e6324bf333a6bc). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18029 **[Test build #85181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85181/testReport)** for PR 18029 at commit [`3c16c47`](https://github.com/apache/spark/commit/3c16c478257c8aed61b1cef4d75360b8bb8b166d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19991: [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat ...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/19991 @holdenk @sethah any other comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19977 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20021 > I checked some call sites. Here is one example that `extraArguments` has `ev.value` instead of local variable. Hey, `ev.value` is not from children, it's the output of the current expression, which we can make sure it's local variable, e.g. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L296 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19977: [SPARK-22771][SQL] Concatenate binary inputs into...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19977#discussion_r158004864 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -566,6 +568,21 @@ object TypeCoercion { } } + /** + * When all inputs in [[Concat]] are binary, coerces an output type to binary + */ + case class ConcatCoercion(conf: SQLConf) extends TypeCoercionRule { --- End diff -- I think we should do it in this PR, because this is a new requirement for the new behavior introduced in this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19977: [SPARK-22771][SQL] Concatenate binary inputs into...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19977#discussion_r158005532 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -566,6 +568,21 @@ object TypeCoercion { } } + /** + * When all inputs in [[Concat]] are binary, coerces an output type to binary + */ + case class ConcatCoercion(conf: SQLConf) extends TypeCoercionRule { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19498 Hi @brkyvz, could you take a look please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19498 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20023 Ideally we should not change behaviors as possible as we can, but since this behavior is from Hive and Hive also changed it, might be OK to follow Hive and also change it? cc @hvanhovell too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19498 **[Test build #85184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85184/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @cloud-fan yes, Hive changed and most important at the moment we are not compliant with SQL standard. So currently Spark is returning results which are different from Hive and not compliant with SQL standard. This is why I proposed this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...
Github user sujithjay commented on the issue: https://github.com/apache/spark/pull/20002 @tgravescs , could you please take a look when you have some time ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes...
GitHub user foxish opened a pull request: https://github.com/apache/spark/pull/20032 [SPARK-22845] [Scheduler] Modify spark.kubernetes.allocation.batch.delay to take time instead of int ## What changes were proposed in this pull request? Fixing configuration that was taking an int which should take time. Discussion in https://github.com/apache/spark/pull/19946#discussion_r156682354 Made the granularity milliseconds as opposed to seconds since there's a use-case for sub-second reactions to scale-up rapidly especially with dynamic allocation. ## How was this patch tested? TODO: manual run of integration tests against this PR. PTAL cc/ @mccheah @liyinan926 @kimoonkim @vanzin @mridulm @jiangxb1987 @ueshin You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache-spark-on-k8s/spark fix-time-conf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20032.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20032 commit 48a3326faaea69bf74d97d028bffdd0552777ffe Author: foxishDate: 2017-12-20T12:03:07Z Change config to support millisecond based timeconf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20032 **[Test build #85185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85185/testReport)** for PR 20032 at commit [`48a3326`](https://github.com/apache/spark/commit/48a3326faaea69bf74d97d028bffdd0552777ffe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - D...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158008588 --- Diff: docs/running-on-kubernetes.md --- @@ -0,0 +1,498 @@ +--- +layout: global +title: Running Spark on Kubernetes +--- +* This will become a table of contents (this text will be scraped). +{:toc} + +Spark can run on clusters managed by [Kubernetes](https://kubernetes.io). This feature makes use of the new experimental native +Kubernetes scheduler that has been added to Spark. + +# Prerequisites + +* A runnable distribution of Spark 2.3 or above. +* A running Kubernetes cluster at version >= 1.6 with access configured to it using +[kubectl](https://kubernetes.io/docs/user-guide/prereqs/). If you do not already have a working Kubernetes cluster, +you may setup a test cluster on your local machine using +[minikube](https://kubernetes.io/docs/getting-started-guides/minikube/). + * We recommend using the latest releases of minikube be updated to the most recent version with the DNS addon enabled. +* You must have appropriate permissions to list, create, edit and delete +[pods](https://kubernetes.io/docs/user-guide/pods/) in your cluster. You can verify that you can list these resources +by running `kubectl auth can-ipods`. + * The service account credentials used by the driver pods must be allowed to create pods, services and configmaps. +* You must have [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) configured in your cluster. + +# How it works + + + + + +spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The mechanism by which spark-submit happens is as follows: + +* Spark creates a spark driver running within a [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). +* The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code. +* When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists +logs and remains in "completed" state in the Kubernetes API till it's eventually garbage collected or manually cleaned up. + +Note that in the completed state, the driver pod does *not* use any computational or memory resources. + +The driver and executor pod scheduling is handled by Kubernetes. It will be possible to affect Kubernetes scheduling +decisions for driver and executor pods using advanced primitives like +[node selectors](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector) +and [node/pod affinities](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) +in a future release. + +# Submitting Applications to Kubernetes + +## Docker Images + +Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to +be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is +frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized +and built for your usage. + +You may build these docker images from sources. +There is a script, `sbin/build-push-docker-images.sh` that you can use to build and push +customized spark distribution images consisting of all the above components. + +Example usage is: + +./sbin/build-push-docker-images.sh -r -t my-tag build +./sbin/build-push-docker-images.sh -r -t my-tag push + +Docker files are under the `dockerfiles/` and can be customized further before +building using the supplied script, or manually. + +## Cluster Mode + +To launch Spark Pi in cluster mode, + +{% highlight bash %} +$ bin/spark-submit \ +--deploy-mode cluster \ +--class org.apache.spark.examples.SparkPi \ +--master k8s://https://: \ +--conf spark.kubernetes.namespace=default \ +--conf spark.executor.instances=5 \ +--conf spark.app.name=spark-pi \ +--conf spark.kubernetes.driver.docker.image= \ +--conf spark.kubernetes.executor.docker.image= \ +local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar +{% endhighlight %} + +The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting +`spark.master` in the application's configuration, must be a URL with the format `k8s://`. Prefixing the +master string with `k8s://` will cause the Spark application to launch on the Kubernetes cluster, with the API server +being
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18029 **[Test build #85181 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85181/testReport)** for PR 18029 at commit [`3c16c47`](https://github.com/apache/spark/commit/3c16c478257c8aed61b1cef4d75360b8bb8b166d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class KinesisInitialPositions ` * `public static class Latest implements KinesisInitialPosition, Serializable ` * `public static class TrimHorizon implements KinesisInitialPosition, Serializable ` * `public static class AtTimestamp implements KinesisInitialPosition, Serializable ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85181/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user yashs360 commented on the issue: https://github.com/apache/spark/pull/18029 Hi @brkyvz , I've added the new changes with the java classes. Had to make the classes serializable for passing them to the KinesisReceiver. Please have a look when you get time. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - Document...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19946 **[Test build #85167 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85167/testReport)** for PR 19946 at commit [`74ac5c9`](https://github.com/apache/spark/commit/74ac5c9e5b495d0133e8e1378867a43f2bc1ff4a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - Document...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85167/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - Document...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19946 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20032 **[Test build #85185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85185/testReport)** for PR 20032 at commit [`48a3326`](https://github.com/apache/spark/commit/48a3326faaea69bf74d97d028bffdd0552777ffe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20032 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20032 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85185/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19498 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85184/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19498 **[Test build #85184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85184/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19498 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20030: [SPARK-10496][CORE] Efficient RDD cumulative sum
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20030 **[Test build #85172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85172/testReport)** for PR 20030 at commit [`4f1d5e2`](https://github.com/apache/spark/commit/4f1d5e269c5f84f6126fea97c201b6cd6fef461f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20030: [SPARK-10496][CORE] Efficient RDD cumulative sum
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20030 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85172/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20030: [SPARK-10496][CORE] Efficient RDD cumulative sum
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20030 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #85183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85183/testReport)** for PR 19904 at commit [`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85183/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20008: [SPARK-22822][TEST] Basic tests for WindowFrameCoercion ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20008 **[Test build #85186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85186/testReport)** for PR 20008 at commit [`19bcca1`](https://github.com/apache/spark/commit/19bcca13ab03c9a5cb5399476e1afac26a30ec49). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20021 Oh, you are right. I misunderstood. After our optimizations, output is also a part of `arguments`. Let me check others again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r158206309 --- Diff: python/pyspark/sql/utils.py --- @@ -110,3 +110,12 @@ def toJArray(gateway, jtype, arr): for i in range(0, len(arr)): jarr[i] = arr[i] return jarr + + +def _require_minimum_pyarrow_version(): --- End diff -- @ueshin did we do the same thing for pandas? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20043 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85232/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20043 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210425 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() --- End diff -- actually, I think `spark.table("records")` is a better example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210374 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() --- End diff -- `.toDF` is not needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19954 I'll finish reading this by Friday, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20035 I think the test failure is not related to this change, but the ongoing work to upgrade pyarrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20041: [SPARK-22042] [FOLLOW-UP] [SQL] ReorderJoinPredicates ca...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/20041 checked the test case failure but I dont think its related to this PR. ``` org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.(It is not a test it is a sbt.testing.SuiteSelector) org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 651 times over 10.008601144 seconds. Last failure message: There are 1 possibly leaked file streams.. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20041: [SPARK-22042] [FOLLOW-UP] [SQL] ReorderJoinPredicates ca...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/20041 Jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r158206546 --- Diff: python/pyspark/sql/utils.py --- @@ -110,3 +110,12 @@ def toJArray(gateway, jtype, arr): for i in range(0, len(arr)): jarr[i] = arr[i] return jarr + + +def _require_minimum_pyarrow_version(): --- End diff -- No. I just checked if `ImportError` occurred or not. We should do the same thing for pandas later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20043 cc @kiszk @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r158208592 --- Diff: python/pyspark/sql/functions.py --- @@ -2141,22 +2141,22 @@ def pandas_udf(f=None, returnType=None, functionType=None): >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql.types import IntegerType, StringType - >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) - >>> @pandas_udf(StringType()) + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) # doctest: +SKIP + >>> @pandas_udf(StringType()) # doctest: +SKIP ... def to_upper(s): ... return s.str.upper() ... - >>> @pandas_udf("integer", PandasUDFType.SCALAR) + >>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP ... def add_one(x): ... return x + 1 ... - >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df = spark.createDataFrame([(1, "John", 21)], ("id", "name", "age")) # doctest: +SKIP --- End diff -- The name change shouldn't have been committed, I'll change it back. I don't think we can make the doctests conditional on if pandas/pyarrow is installed, so unless we make these required dependencies and have them installed on all the workers, we need to skip them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20043: [SPARK-22856][SQL] Add wrappers for codegen outpu...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20043#discussion_r158209659 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -56,7 +56,36 @@ import org.apache.spark.util.{ParentClassLoader, Utils} * @param value A term for a (possibly primitive) value of the result of the evaluation. Not * valid if `isNull` is set to `true`. */ -case class ExprCode(var code: String, var isNull: String, var value: String) +case class ExprCode(var code: String, var isNull: ExprValue, var value: ExprValue) + + +// An abstraction that represents the evaluation result of [[ExprCode]]. +abstract class ExprValue + +object ExprValue { + implicit def exprValueToString(exprValue: ExprValue): String = exprValue.toString +} + +// A literal evaluation of [[ExprCode]]. +case class LiteralValue(val value: String) extends ExprValue { + override def toString: String = value +} + +// A variable evaluation of [[ExprCode]]. +case class VariableValue(val variableName: String) extends ExprValue { + override def toString: String = variableName +} + +// A statement evaluation of [[ExprCode]]. +case class StatementValue(val statement: String) extends ExprValue { + override def toString: String = statement +} + +// A global variable evaluation of [[ExprCode]]. +case class GlobalValue(val value: String) extends ExprValue { --- End diff -- for compacted global variables, we may get something like `arr[1]` while `arr` is a global variable. Is `arr[1]` a statement or global variable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20043 **[Test build #85241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85241/testReport)** for PR 20043 at commit [`d120750`](https://github.com/apache/spark/commit/d120750ff61bb066e7ceb628f3356fa37af462f5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85242/testReport)** for PR 19884 at commit [`ae84c84`](https://github.com/apache/spark/commit/ae84c8454875906e488b895e18ad78ddf6e9fbc9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20043: [SPARK-22856][SQL] Add wrappers for codegen outpu...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20043#discussion_r158210849 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -56,7 +56,36 @@ import org.apache.spark.util.{ParentClassLoader, Utils} * @param value A term for a (possibly primitive) value of the result of the evaluation. Not * valid if `isNull` is set to `true`. */ -case class ExprCode(var code: String, var isNull: String, var value: String) +case class ExprCode(var code: String, var isNull: ExprValue, var value: ExprValue) + + +// An abstraction that represents the evaluation result of [[ExprCode]]. +abstract class ExprValue + +object ExprValue { + implicit def exprValueToString(exprValue: ExprValue): String = exprValue.toString +} + +// A literal evaluation of [[ExprCode]]. +case class LiteralValue(val value: String) extends ExprValue { + override def toString: String = value +} + +// A variable evaluation of [[ExprCode]]. +case class VariableValue(val variableName: String) extends ExprValue { + override def toString: String = variableName +} + +// A statement evaluation of [[ExprCode]]. +case class StatementValue(val statement: String) extends ExprValue { + override def toString: String = statement +} + +// A global variable evaluation of [[ExprCode]]. +case class GlobalValue(val value: String) extends ExprValue { --- End diff -- It is considered as global variable now, as it can be accessed globally and don't/can't be parameterized. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210754 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +/* + You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal + data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without + full data shuffle. + */ +// coalesce of 10 could create 10 parquet files under each partitions, +// if data is huge and make sense to do partitioning. +hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/20029 It seems each time when connect to thrift server through beeline, the `SessionState.start(state)` will be called two times. one is in `HiveSessionImpl:open` , another is in `HiveClientImpl.newSession()` for `sql("use default")` . When close the beeline connection, only close the HiveSession with `HiveSessionImpl.close()`, but the object of `HiveClientImpl.newSession()` will be left over. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19977 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20035 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158205893 --- Diff: docs/building-spark.md --- @@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example: -./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn +./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes --- End diff -- Yea I don't think you need to block this pr with this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r158206051 --- Diff: python/pyspark/sql/functions.py --- @@ -2141,22 +2141,22 @@ def pandas_udf(f=None, returnType=None, functionType=None): >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql.types import IntegerType, StringType - >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) - >>> @pandas_udf(StringType()) + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) # doctest: +SKIP + >>> @pandas_udf(StringType()) # doctest: +SKIP ... def to_upper(s): ... return s.str.upper() ... - >>> @pandas_udf("integer", PandasUDFType.SCALAR) + >>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP ... def add_one(x): ... return x + 1 ... - >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df = spark.createDataFrame([(1, "John", 21)], ("id", "name", "age")) # doctest: +SKIP --- End diff -- why change `John Doe` to `John`? And are we going to re-enable these doctest later? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158205387 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) --- End diff -- Can't we just use `ShortDecimal`, `IntDecimal`...? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158205620 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { --- End diff -- Is this different than `forType` if applied on `Literal.dataType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20035: [SPARK-22848][SQL] Eliminate mutable state from S...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20035 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158206388 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 +if (precision <= MAX_PRECISION) { + // Adjustment only needed when we exceed max precision + DecimalType(precision, scale) +} else { + // Precision/scale exceed maximum precision. Result must be adjusted to MAX_PRECISION. + val intDigits = precision - scale + // If original scale less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise + // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits + val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE) --- End diff -- Sounds like `MAXIMUM_ADJUSTED_SCALE` instead of `MINIMUM_ADJUSTED_SCALE`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19884 I used a workaround for timestamp casts that allows the tests to pass for me locally, and left a note to look into the root cause later. Hopefully this should pass now and we will be good to merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210132 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210714 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) --- End diff -- This is not a standard usage, let's not put it in the example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210666 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; --- End diff -- it's weird to create an external table without a location. User may be confused between the difference between managed table and external table. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85244/testReport)** for PR 19884 at commit [`b0200ef`](https://github.com/apache/spark/commit/b0200efd30c6fe77ec6e57d65f3bc828be0e1802). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r158212056 --- Diff: python/pyspark/sql/functions.py --- @@ -2141,22 +2141,23 @@ def pandas_udf(f=None, returnType=None, functionType=None): >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql.types import IntegerType, StringType - >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) - >>> @pandas_udf(StringType()) + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) # doctest: +SKIP + >>> @pandas_udf(StringType()) # doctest: +SKIP ... def to_upper(s): ... return s.str.upper() ... - >>> @pandas_udf("integer", PandasUDFType.SCALAR) + >>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP ... def add_one(x): ... return x + 1 ... - >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df = spark.createDataFrame([(1, "John Doe", 21)], + ...("id", "name", "age")) # doctest: +SKIP >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ ... .show() # doctest: +SKIP +--+--++ |slen(name)|to_upper(name)|add_one(age)| +--+--++ - | 8| JOHN DOE| 22| + | 8| JOHN| 22| --- End diff -- oops, done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19977 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19977 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85235/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19977 **[Test build #85235 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85235/testReport)** for PR 19977 at commit [`fc14aeb`](https://github.com/apache/spark/commit/fc14aeb4e92e67aba1750fc1bc2b0fc9afaa5fac). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20035 yea it's failing globally, I'm merging this PR, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20043 **[Test build #85232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85232/testReport)** for PR 20043 at commit [`d5c986a`](https://github.com/apache/spark/commit/d5c986a1cab410c4eb64a72119346875d7607be6). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ExprCode(var code: String, var isNull: ExprValue, var value: ExprValue)` * `case class LiteralValue(val value: String) extends ExprValue ` * `case class VariableValue(val variableName: String) extends ExprValue ` * `case class StatementValue(val statement: String) extends ExprValue ` * `case class GlobalValue(val value: String) extends ExprValue ` * `case class SubExprEliminationState(isNull: ExprValue, value: ExprValue)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20008: [SPARK-22822][TEST] Basic tests for WindowFrameCoercion ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20008 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20008: [SPARK-22822][TEST] Basic tests for WindowFrameCoercion ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20008 **[Test build #85233 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85233/testReport)** for PR 20008 at commit [`ec07bc2`](https://github.com/apache/spark/commit/ec07bc2a463b089dd5798ab9e6bf8aea1b8ccd28). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20008: [SPARK-22822][TEST] Basic tests for WindowFrameCoercion ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20008 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85233/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20043 **[Test build #85243 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85243/testReport)** for PR 20043 at commit [`81c9b6e`](https://github.com/apache/spark/commit/81c9b6e73ee64adcd8fc931d51f3faa98b892e0b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r158211101 --- Diff: python/pyspark/sql/functions.py --- @@ -2141,22 +2141,23 @@ def pandas_udf(f=None, returnType=None, functionType=None): >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql.types import IntegerType, StringType - >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) - >>> @pandas_udf(StringType()) + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) # doctest: +SKIP + >>> @pandas_udf(StringType()) # doctest: +SKIP ... def to_upper(s): ... return s.str.upper() ... - >>> @pandas_udf("integer", PandasUDFType.SCALAR) + >>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP ... def add_one(x): ... return x + 1 ... - >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df = spark.createDataFrame([(1, "John Doe", 21)], + ...("id", "name", "age")) # doctest: +SKIP >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ ... .show() # doctest: +SKIP +--+--++ |slen(name)|to_upper(name)|add_one(age)| +--+--++ - | 8| JOHN DOE| 22| + | 8| JOHN| 22| --- End diff -- nit: we should revert this too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20035 **[Test build #85237 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85237/testReport)** for PR 20035 at commit [`f0163e7`](https://github.com/apache/spark/commit/f0163e7b68aa09fef5c1dc7f25e00170354a1ab2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20036 **[Test build #85236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85236/testReport)** for PR 20036 at commit [`53661eb`](https://github.com/apache/spark/commit/53661eb72bba55376bc6112b51c25489522d309c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20035 **[Test build #85237 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85237/testReport)** for PR 20035 at commit [`f0163e7`](https://github.com/apache/spark/commit/f0163e7b68aa09fef5c1dc7f25e00170354a1ab2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20036 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19981: [SPARK-22786][SQL] only use AppStatusPlugin in history s...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19981 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158207539 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. --- End diff -- Not sure if this blog link can be available for long time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158205829 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 +if (precision <= MAX_PRECISION) { + // Adjustment only needed when we exceed max precision + DecimalType(precision, scale) --- End diff -- Shouldn't we also prevent `scale` > `MAX_SCALE`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158205151 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala --- @@ -243,17 +248,43 @@ object DecimalPrecision extends TypeCoercionRule { // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case b @ BinaryOperator(left, right) if left.dataType != right.dataType => - (left.dataType, right.dataType) match { -case (t: IntegralType, DecimalType.Fixed(p, s)) => - b.makeCopy(Array(Cast(left, DecimalType.forType(t)), right)) -case (DecimalType.Fixed(p, s), t: IntegralType) => - b.makeCopy(Array(left, Cast(right, DecimalType.forType(t -case (t, DecimalType.Fixed(p, s)) if isFloat(t) => - b.makeCopy(Array(left, Cast(right, DoubleType))) -case (DecimalType.Fixed(p, s), t) if isFloat(t) => - b.makeCopy(Array(Cast(left, DoubleType), right)) -case _ => - b - } + nondecimalLiteralAndDecimal(b).lift((left, right)).getOrElse( +nondecimalNonliteralAndDecimal(b).applyOrElse((left.dataType, right.dataType), + (_: (DataType, DataType)) => b)) } + + /** + * Type coercion for BinaryOperator in which one side is a non-decimal literal numeric, and the + * other side is a decimal. + */ + private def nondecimalLiteralAndDecimal( --- End diff -- Is this rule newly introduced? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158206693 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 +if (precision <= MAX_PRECISION) { + // Adjustment only needed when we exceed max precision + DecimalType(precision, scale) +} else { + // Precision/scale exceed maximum precision. Result must be adjusted to MAX_PRECISION. + val intDigits = precision - scale + // If original scale less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise + // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits + val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE) + val adjustedScale = Math.max(MAX_PRECISION - intDigits, minScaleValue) --- End diff -- Sounds like `Math.min`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20035 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85237/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org