[GitHub] spark issue #20403: [SPARK-23238][SQL] Externalize SQLConf spark.sql.executi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86742/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][SQL] Externalize SQLConf spark.sql.executi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][SQL] Externalize SQLConf spark.sql.executi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86742 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86742/testReport)** for PR 20403 at commit [`0c05526`](https://github.com/apache/spark/commit/0c0552625eecd984d268c8bed2903c87b5adce58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20408: [SPARK-23189][Core][Web UI] Reflect stage level b...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20408#discussion_r164292079 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -594,12 +606,24 @@ private[spark] class AppStatusListener( stage.executorSummaries.values.foreach(update(_, now)) update(stage, now, last = true) + + val executorIdsForStage = stage.executorSummaries.keySet + executorIdsForStage.foreach { executorId => +liveExecutors.get(executorId).foreach { exec => + removeBlackListedStageFrom(exec, event.stageInfo.stageId, now) --- End diff -- I guess github diff collapse tricked us here. This changes belongs to the method onStageCompleted (and definitely not for onExecutorUnblacklisted). This is where I remove completed stages from the live executors. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86740/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #86740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86740/testReport)** for PR 20146 at commit [`b884fb5`](https://github.com/apache/spark/commit/b884fb5c0ce1e627390d08d8425721ea8e4d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86741/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20402 **[Test build #86741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86741/testReport)** for PR 20402 at commit [`efe9eaf`](https://github.com/apache/spark/commit/efe9eaf775e325909cbb9639f64c9099b90b2f99). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20409 Thank you @gatorsmile and @viirya. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][SQL] Externalize SQLConf spark.sql.executi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86742 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86742/testReport)** for PR 20403 at commit [`0c05526`](https://github.com/apache/spark/commit/0c0552625eecd984d268c8bed2903c87b5adce58). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][SQL] Externalize SQLConf spark.sql.executi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/313/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][SQL] Externalize SQLConf spark.sql.executi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86739/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.s...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20403#discussion_r164288478 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1043,11 +1043,11 @@ object SQLConf { val ARROW_EXECUTION_ENABLE = buildConf("spark.sql.execution.arrow.enabled") - .internal() - .doc("Make use of Apache Arrow for columnar data transfers. Currently available " + -"for use with pyspark.sql.DataFrame.toPandas with the following data types: " + -"StringType, BinaryType, BooleanType, DoubleType, FloatType, ByteType, IntegerType, " + -"LongType, ShortType") + .doc("When true, make use of Apache Arrow for columnar data transfers. Currently available " + +"for use with pyspark.sql.DataFrame.toPandas, and " + +"pyspark.sql.SparkSession.createDataFrame when its input is a Pandas DataFrame. " + +"The following data types are unsupported: " + +"MapType, ArrayType of TimestampType, and nested StructType.") .booleanConf .createWithDefault(false) --- End diff -- Yup. Let me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20068 **[Test build #86739 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86739/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/312/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20402 **[Test build #86741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86741/testReport)** for PR 20402 at commit [`efe9eaf`](https://github.com/apache/spark/commit/efe9eaf775e325909cbb9639f64c9099b90b2f99). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.s...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20403#discussion_r164287467 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1043,11 +1043,11 @@ object SQLConf { val ARROW_EXECUTION_ENABLE = buildConf("spark.sql.execution.arrow.enabled") - .internal() - .doc("Make use of Apache Arrow for columnar data transfers. Currently available " + -"for use with pyspark.sql.DataFrame.toPandas with the following data types: " + -"StringType, BinaryType, BooleanType, DoubleType, FloatType, ByteType, IntegerType, " + -"LongType, ShortType") + .doc("When true, make use of Apache Arrow for columnar data transfers. Currently available " + +"for use with pyspark.sql.DataFrame.toPandas, and " + +"pyspark.sql.SparkSession.createDataFrame when its input is a Pandas DataFrame. " + +"The following data types are unsupported: " + +"MapType, ArrayType of TimestampType, and nested StructType.") .booleanConf .createWithDefault(false) --- End diff -- `spark.sql.execution.arrow.maxRecordsPerBatch` is also mentioned in the doc change at #19575. Shall we also externalize it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20383: [SPARK-23200] Reset Kubernetes-specific config on Checkp...
Github user ssaavedra commented on the issue: https://github.com/apache/spark/pull/20383 I can probably take a look at testing this over 2.3.0-rc2 on Monday. I did not test this on a clean 2.3.0-ish branch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #86740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86740/testReport)** for PR 20146 at commit [`b884fb5`](https://github.com/apache/spark/commit/b884fb5c0ce1e627390d08d8425721ea8e4d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/311/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20146 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86737/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86738/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86737 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86737/testReport)** for PR 20403 at commit [`1f4d288`](https://github.com/apache/spark/commit/1f4d2884ba5b56e06427ce3d91cb6ac5f8f2b7b6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #86738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86738/testReport)** for PR 20146 at commit [`b884fb5`](https://github.com/apache/spark/commit/b884fb5c0ce1e627390d08d8425721ea8e4d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20417: [SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20417 **[Test build #4081 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4081/testReport)** for PR 20417 at commit [`9ef6939`](https://github.com/apache/spark/commit/9ef6939a35981f70253501d19599d93207042370). * This patch **fails Scala style tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20417: [SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20417 **[Test build #4081 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4081/testReport)** for PR 20417 at commit [`9ef6939`](https://github.com/apache/spark/commit/9ef6939a35981f70253501d19599d93207042370). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20068 **[Test build #86739 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86739/testReport)** for PR 20068 at commit [`156d755`](https://github.com/apache/spark/commit/156d755d5a734a00c4c69dfc3565364f3843fca1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20068 ping @aa8y --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20068 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20415: [SPARK-23247][SQL]combines Unsafe operations and statist...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/20415 @heary-cao have you benchmarked this? The reason I am asking is because Spark SQL chains iterators, these are pipelined and only materialized when we need to. Your PR effectively removes two virtual calls (hasNext/next) per tuple, so I don't see too much benefit here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20068: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20068 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: [SPARK-23249] Improved block merging logic for partition...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20372 @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20416: [SPARK-23248][PYTHON][EXAMPLES] Relocate module d...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20416 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20416: [SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20416 Merged to master and branch-2.3. Thank you @srowen, @viirya and @felixcheung. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20417: [SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20417 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164286075 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar --- End diff -- `Scalar Vectorized UDFs`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20417: [SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20417 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164286074 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. --- End diff -- `Grouped Vectorized UDFs`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20417: [SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for ...
GitHub user CCInCharge opened a pull request: https://github.com/apache/spark/pull/20417 [SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFrameWriter ## What changes were proposed in this pull request? Fix typo in ScalaDoc for DataFrameWriter - originally stated "This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0", should be "starting with Spark 2.1.0". ## How was this patch tested? Check of correct spelling in ScalaDoc Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/CCInCharge/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20417.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20417 commit 9ef6939a35981f70253501d19599d93207042370 Author: CCInChargeDate: 2018-01-28T01:21:07Z Fix typo in ScalaDoc for DataFrameWriter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20414 Just for context, I'm seeing RDD.repartition being used *a lot*. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20406: [SPARK-23230][SQL]Error by creating a data table when us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20406 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86736/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20406: [SPARK-23230][SQL]Error by creating a data table when us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20406 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20406: [SPARK-23230][SQL]Error by creating a data table when us...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20406 **[Test build #86736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86736/testReport)** for PR 20406 at commit [`f370dd6`](https://github.com/apache/spark/commit/f370dd6217cf8a590ef52ecc970e4dc33c235631). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86737/testReport)** for PR 20403 at commit [`1f4d288`](https://github.com/apache/spark/commit/1f4d2884ba5b56e06427ce3d91cb6ac5f8f2b7b6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/309/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #86738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86738/testReport)** for PR 20146 at commit [`b884fb5`](https://github.com/apache/spark/commit/b884fb5c0ce1e627390d08d8425721ea8e4d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/310/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20403 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20146 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86728/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20414 **[Test build #86728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86728/testReport)** for PR 20414 at commit [`6910ed6`](https://github.com/apache/spark/commit/6910ed62c272bedfa251cab589bb52bed36be3ed). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86734/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #86734 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86734/testReport)** for PR 20146 at commit [`b884fb5`](https://github.com/apache/spark/commit/b884fb5c0ce1e627390d08d8425721ea8e4d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20369 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20369 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86735/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20369 **[Test build #86735 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86735/testReport)** for PR 20369 at commit [`d311d56`](https://github.com/apache/spark/commit/d311d5639b3af9123e0c6dbe38468f0172e06712). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20369 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20369 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86733/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20369 **[Test build #86733 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86733/testReport)** for PR 20369 at commit [`d311d56`](https://github.com/apache/spark/commit/d311d5639b3af9123e0c6dbe38468f0172e06712). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86730/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86730/testReport)** for PR 20403 at commit [`1f4d288`](https://github.com/apache/spark/commit/1f4d2884ba5b56e06427ce3d91cb6ac5f8f2b7b6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20375: [SPARK-23199][SQL]improved Removes repetition from group...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20375 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86732/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20375: [SPARK-23199][SQL]improved Removes repetition from group...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20375 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20375: [SPARK-23199][SQL]improved Removes repetition from group...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20375 **[Test build #86732 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86732/testReport)** for PR 20375 at commit [`caf581f`](https://github.com/apache/spark/commit/caf581f7f171912af4cebbc3a96887c7bb4a87e5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are passed...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20385 @atallahhezbor Yeah! Please help us improve the test coverage. We do not have a clear way to test the functionality in `SparkExecuteStatementOperation` Adding unit test cases for `HiveUtils.toHiveString` is enough if we move the code changes to `HiveUtils.toHiveString` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20345 Also cc @wzhfy Do you have a bandwidth to review PRs? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20343: [SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCD...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20343#discussion_r164279795 --- Diff: sql/core/src/test/resources/tpcds-v2.7.0/q11.sql --- @@ -0,0 +1,78 @@ +with year_total as ( + select c_customer_id customer_id + ,c_first_name customer_first_name + ,c_last_name customer_last_name + ,c_preferred_cust_flag customer_preferred_cust_flag + ,c_birth_country customer_birth_country + ,c_login customer_login + ,c_email_address customer_email_address + ,d_year dyear + ,sum(ss_ext_list_price-ss_ext_discount_amt) year_total + ,'s' sale_type + from customer + ,store_sales + ,date_dim + where c_customer_sk = ss_customer_sk + and ss_sold_date_sk = d_date_sk + group by c_customer_id + ,c_first_name + ,c_last_name + ,c_preferred_cust_flag + ,c_birth_country + ,c_login + ,c_email_address + ,d_year + union all + select c_customer_id customer_id + ,c_first_name customer_first_name + ,c_last_name customer_last_name + ,c_preferred_cust_flag customer_preferred_cust_flag + ,c_birth_country customer_birth_country + ,c_login customer_login + ,c_email_address customer_email_address + ,d_year dyear + ,sum(ws_ext_list_price-ws_ext_discount_amt) year_total + ,'w' sale_type + from customer + ,web_sales + ,date_dim + where c_customer_sk = ws_bill_customer_sk + and ws_sold_date_sk = d_date_sk + group by c_customer_id + ,c_first_name + ,c_last_name + ,c_preferred_cust_flag + ,c_birth_country + ,c_login + ,c_email_address + ,d_year + ) + select + t_s_secyear.customer_id + ,t_s_secyear.customer_first_name + ,t_s_secyear.customer_last_name + ,t_s_secyear.customer_email_address --- End diff -- Regarding a keywords capitalization rule, this is just for readability. We do not enforce it, but it is preferred. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20370: Changing JDBC relation to better process quotes
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20370 ping @conorbmurphy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20375: [SPARK-23199][SQL]improved Removes repetition from group...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20375 LGTM pending Jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20406: [SPARK-23230][SQL]Error by creating a data table when us...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20406 **[Test build #86736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86736/testReport)** for PR 20406 at commit [`f370dd6`](https://github.com/apache/spark/commit/f370dd6217cf8a590ef52ecc970e4dc33c235631). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20406: [SPARK-23230][SQL]Error by creating a data table when us...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20406 Also cc @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20406: [SPARK-23230][SQL]Error by creating a data table when us...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20406 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20396 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20396 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86731/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20396 **[Test build #86731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86731/testReport)** for PR 20396 at commit [`8a68f75`](https://github.com/apache/spark/commit/8a68f758a7a41f6c2a9a58f54a982745665be6a6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20409: [SPARK-23233][PYTHON] Reset the cache in asNondet...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20409 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20409 Thanks! Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20409 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86729/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20409 **[Test build #86729 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86729/testReport)** for PR 20409 at commit [`b23ff02`](https://github.com/apache/spark/commit/b23ff02f543ecc92db574b808ea00f9ff7d236f8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20393 @sameeragarwal I am not sure if we can make shuffle fetch deterministic - without quite a lot of perf overhead; do you have any thoughts on how to do this in case I am missing something here ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20414 In addition, any use of random in spark code will get affected by this - unless input is an idempotent source; even if random initialization is done predictably with the partition index (which we were doing here anyway). We might want to look at mllib and other places as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20383: [SPARK-23200] Reset Kubernetes-specific config on Checkp...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20383 So you have tested this on latest Spark 2.3.0 bit? Test aside, do people think it is useful to include this fix in the 2.3.0 release? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20416: [SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20416 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20416: [SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20416 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86727/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org