[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20372 please see https://spark.apache.org/contributing.html open a JIRA and update this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164266681 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- I think this is somewhat conflicted ... If I can SparkSession.getOrCreate in python and JVM already has a defaultSession, what should happen? I suppose one approach would be to keep python independent.. but with this change it would overwrite the defaultSession that might be valid? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20383: [SPARK-23200] Reset Kubernetes-specific config on Checkp...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20383 without this fix streaming app can't properly recovery from checkpoint, correct? that seems fairly important to me. was apache-spark-on-k8s/spark-integration done before this PR was opened? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20403 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20409 About test case, can't we just use the reproducer in the PR description to check it we change the deterministic status of udf? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20409 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecut...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20413 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20413 Thanks! Merging to master and 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20415: [SPARK-23247][SQL]combines Unsafe operations and ...
GitHub user heary-cao opened a pull request: https://github.com/apache/spark/pull/20415 [SPARK-23247][SQL]combines Unsafe operations and statistics operations in Scan Data Source ## What changes were proposed in this pull request? Currently, we scan the execution plan of the data source, first the unsafe operation of each row of data, and then re traverse the data for the count of rows. In terms of performance, this is not necessary. this PR combines the two operations and makes statistics on the number of rows while performing the unsafe operation. Before modified, ``` val unsafeRow = rdd.mapPartitionsWithIndexInternal { (index, iter) => val proj = UnsafeProjection.create(schema) proj.initialize(index) iter.map(proj) } val numOutputRows = longMetric("numOutputRows") unsafeRow.map { r => numOutputRows += 1 r } ``` After modified, val numOutputRows = longMetric("numOutputRows") rdd.mapPartitionsWithIndexInternal { (index, iter) => val proj = UnsafeProjection.create(schema) proj.initialize(index) iter.map( r => { numOutputRows += 1 proj(r) }) } ## How was this patch tested? the existed test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/heary-cao/spark DataSourceScanExec Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20415.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20415 commit e3e09d98072bd39328a4e7d4de1ddd38594c6232 Author: caoxuewenDate: 2018-01-27T06:27:37Z combines Unsafe operations and statistics operations in Scan Data Source --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20413 **[Test build #4079 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4079/testReport)** for PR 20413 at commit [`3ce18c4`](https://github.com/apache/spark/commit/3ce18c47fb50f6c1f12db1fb34a6196e5040694d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19285 thanks all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164261850 --- Diff: python/pyspark/sql/window.py --- @@ -124,16 +124,19 @@ def rangeBetween(start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). """ -if start <= Window._PRECEDING_THRESHOLD: -start = Window.unboundedPreceding -if end >= Window._FOLLOWING_THRESHOLD: -end = Window.unboundedFollowing +if isinstance(start, int) and isinstance(end, int): +if start <= Window._PRECEDING_THRESHOLD: +start = Window.unboundedPreceding --- End diff -- @jiangxb1987 Do you mean to change to ``` if isinstance(start, int) and isinstance(end, int): if start == Window._PRECEDING_THRESHOLD: # Window._PRECEDING_THRESHOLD == Long.MinValue start = Window.unboundedPreceding if end == Window._FOLLOWING_THRESHOLD: # Window._FOLLOWING_THRESHOLD == Long.MaxValue end = Window.unboundedFollowing ``` I ran python tests, tests.py failed at ``` with patch("sys.maxsize", 2 ** 127 - 1): importlib.reload(window) self.assertTrue(rows_frame_match()) self.assertTrue(range_frame_match()) ``` So I guess I will keep ```if start <= Window._PRECEDING_THRESHOLD``` and ``` if end >= Window._FOLLOWING_THRESHOLD``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20401: [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20401 Thank you for merging this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164261725 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + --- End diff -- We need to warn users that, the Group Map Pandas UDF requires to load all the data of a group into memory, which is not controlled by `spark.sql.execution.arrow.maxRecordsPerBatch`, and may OOM if the data is skewed and some partitions have a lot of records. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164261678 --- Diff: python/pyspark/sql/window.py --- @@ -124,16 +124,19 @@ def rangeBetween(start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). """ -if start <= Window._PRECEDING_THRESHOLD: -start = Window.unboundedPreceding -if end >= Window._FOLLOWING_THRESHOLD: -end = Window.unboundedFollowing +if isinstance(start, int) and isinstance(end, int): --- End diff -- Yup, but I think we can still have `long` type one: ``` >>> long(1) 1L >>> isinstance(long(1), int) False ``` You can simply do like `isinstance(long(1), (int, long))` with ``` if sys.version >= '3': long = int ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164261639 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. + + + +{% include_example group_map_pandas_udf python/sql/arrow.py %} + + + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) and +[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply). + +## Usage Notes + +###
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164261413 --- Diff: python/pyspark/sql/window.py --- @@ -124,16 +124,19 @@ def rangeBetween(start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). """ -if start <= Window._PRECEDING_THRESHOLD: -start = Window.unboundedPreceding -if end >= Window._FOLLOWING_THRESHOLD: -end = Window.unboundedFollowing +if isinstance(start, int) and isinstance(end, int): --- End diff -- @HyukjinKwon @jiangxb1987 Thank you very much for your comments. It seems to me that int and long are "unified" in python 2. I tried the following: ``` Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> value = 9223372036854775807 >>> isinstance(value, int) True ``` It seems to me that we don't have to do long for Python 2. I guess I will keep ```if isinstance(start, int) and isinstance(end, int)``` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20409 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20409 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20403 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164260764 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. + + + +{% include_example group_map_pandas_udf python/sql/arrow.py %} + + + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) and +[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply). + +## Usage Notes + +###
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20369 LGTM, pending jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164260619 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. --- End diff -- Which name? If you mean "split-apply-combine", I think it's fine - https://pandas.pydata.org/pandas-docs/stable/groupby.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20367: [SPARK-23166][ML] Add maxDF Parameter to CountVec...
Github user ymazari commented on a diff in the pull request: https://github.com/apache/spark/pull/20367#discussion_r164260027 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -160,6 +187,11 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") override val uid: String) } else { $(minDF) * input.cache().count() } +val maxDf = if ($(maxDF) >= 1.0) { + $(maxDF) +} else { + $(maxDF) * input.cache().count() +} --- End diff -- Good points. - I added a check that maxDF >= minDF - Changed the code so that counting (and caching) is done only once - I refactored the code so that "filter()" is only invoked if minDF or maxDF is set - Added an un-persisting the input after the counting is done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20369 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86722/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20407 **[Test build #86722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86722/testReport)** for PR 20407 at commit [`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86725/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #86725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86725/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/20372 I agree with @ash211. Applications shouldn't rely on the order of the files within a partition. This optimization looks good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20413 **[Test build #4079 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4079/testReport)** for PR 20413 at commit [`3ce18c4`](https://github.com/apache/spark/commit/3ce18c47fb50f6c1f12db1fb34a6196e5040694d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20401: [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc exa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20401 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20401: [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20401 Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20413 +1. I originally wrote this line, and I'm reasonably confident that (as indicated by the comment) I didn't intend to check anything other than the nullity of lastExecution. I continue to be confused about why I wrote such a strange null check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20394 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20413 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20394: [SPARK-23214][SQL] cached data should not carry extra hi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20394 LGTM Thanks! Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20394#discussion_r164254321 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -73,11 +73,16 @@ case class InMemoryRelation( @transient val partitionStatistics = new PartitionStatistics(output) override def computeStats(): Statistics = { -if (batchStats.value == 0L) { - // Underlying columnar RDD hasn't been materialized, use the stats from the plan to cache - statsOfPlanToCache +if (sizeInBytesStats.value == 0L) { + // Underlying columnar RDD hasn't been materialized, use the stats from the plan to cache. + // Note that we should drop the hint info here. We may cache a plan whose root node is a hint + // node. When we lookup the cache with a semantically same plan without hint info, the plan + // returned by cache lookup should not have hint info. If we lookup the cache with a + // semantically same plan with a different hint info, `CacheManager.useCachedData` will take + // care of it and retain the hint info in the lookup input plan. + statsOfPlanToCache.copy(hints = HintInfo()) --- End diff -- This is a new behavior we introduced in 2.3. I will first keep the behavior unchanged and merge it to 2.3. We can have more discussion in the next release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD ...
GitHub user jiangxb1987 opened a pull request: https://github.com/apache/spark/pull/20414 [SPARK-23243][SQL] Shuffle+Repartition on an RDD could lead to incorrect answers ## What changes were proposed in this pull request? The RDD repartition also uses the round-robin way to distribute data, this can also cause incorrect answers on RDD workload the similar way as in #20393 However, the approach that fixes DataFrame.repartition() doesn't apply on the RDD repartition issue, because the input data can be non-comparable, as discussed in https://github.com/apache/spark/pull/20393#issuecomment-360912451 Here, I propose a quick fix that distribute elements use their hashes, this will cause perf regression if you have highly skewed input data, but it will ensure result correctness. ## How was this patch tested? Added test case in `RDDSuite` to ensure `RDD.repartition()` generate consistent answers. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiangxb1987/spark rdd-repartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20414.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20414 commit 6910ed62c272bedfa251cab589bb52bed36be3ed Author: Xingbo JiangDate: 2018-01-27T00:34:24Z fix RDD.repartition() --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164253931 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- Yeah, I think we can clear it in `def stop`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/20393 I'm fine with merging this -- I just dont want to this issue to be forgotten for RDDs as I think its a major correctness issue. @mridulm @sameeragarwal Lets continue the discussion on the new JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20413: [SC-9624][SS][TESTS] Don't access `lastExecution....
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/20413 [SC-9624][SS][TESTS] Don't access `lastExecution.executedPlan` in StreamTest ## What changes were proposed in this pull request? `lastExecution.executedPlan` is lazy val so accessing it in StreamTest may need to acquire the lock of `lastExecution`. It may be waiting forever when the streaming thread is holding it and running a continuous Spark job. This PR changes to check if `s.lastExecution` is null to avoid accessing `lastExecution.executedPlan`. ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-23245 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20413.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20413 commit 3ce18c47fb50f6c1f12db1fb34a6196e5040694d Author: Jose TorresDate: 2018-01-26T22:57:18Z better check --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164252270 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. + + + +{% include_example group_map_pandas_udf python/sql/arrow.py %} + + + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) and +[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply). + +## Usage Notes + +###
[GitHub] spark issue #20413: [SC-9624][SS][TESTS] Don't access `lastExecution.execute...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20413 cc @jose-torres @tdas --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164252424 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. + + + +{% include_example group_map_pandas_udf python/sql/arrow.py %} + + + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) and +[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply). + +## Usage Notes + +###
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164251821 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. + + + +{% include_example group_map_pandas_udf python/sql/arrow.py %} + + + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) and +[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply). + +## Usage Notes + +###
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164251759 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use `groupBy().apply()`, the user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. + + + +{% include_example group_map_pandas_udf python/sql/arrow.py %} + + + +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf) and +[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply). + +## Usage Notes + +###
[GitHub] spark pull request #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaS...
Github user zsxwing closed the pull request at: https://github.com/apache/spark/pull/20412 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20412 Thanks! Merging to master and 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20412 **[Test build #4078 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4078/testReport)** for PR 20412 at commit [`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class KafkaSourceSuiteBase extends KafkaSourceTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20405: [SPARK-23229][SQL] Dataset.hint should use planWi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20405#discussion_r164250206 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1216,7 +1216,7 @@ class Dataset[T] private[sql]( */ @scala.annotation.varargs def hint(name: String, parameters: Any*): Dataset[T] = withTypedPlan { -UnresolvedHint(name, parameters, logicalPlan) +UnresolvedHint(name, parameters, planWithBarrier) --- End diff -- I think @viirya has a valid concern. think about ``` val df1 = spark.table("t").select("id") df1.hint("broadcast", "id") ``` We should transform down the plan of `df1`, find the bottom table relation and apply the hint. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164250138 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Apache Arrow + +## Apache Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. + + + +{% include_example dataframe_with_arrow python/sql/arrow.py %} + + + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the +DataFrame to the driver program and should be done on a small subset of the data. Not all Spark +data types are currently supported and an error can be raised if a column has an unsupported type, +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`, +Spark will fall back to create the DataFrame without Arrow. + +## Pandas UDFs (a.k.a. Vectorized UDFs) + +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a decorator +or to wrap the function, no additional configuration is required. Currently, there are two types of +Pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions such +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and return +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting +columns into batches and calling the function for each batch as a subset of the data, then +concatenating the results together. + +The following example shows how to create a scalar Pandas UDF that computes the product of 2 columns. + + + +{% include_example scalar_pandas_udf python/sql/arrow.py %} + + + +### Group Map +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. --- End diff -- @rxin WDYT about this name? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164249345 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) { } @Override -final int getArrayLength(int rowId) { - return accessor.getInnerValueCountAt(rowId); -} - -@Override -final int getArrayOffset(int rowId) { - return accessor.getOffsetBuffer().getInt(rowId * accessor.OFFSET_WIDTH); +final ColumnarArray getArray(int rowId) { + int index = rowId * accessor.OFFSET_WIDTH; + int start = offsets.getInt(index); + int end = offsets.getInt(index + accessor.OFFSET_WIDTH); --- End diff -- yes, the code is correct. `offsetBuffer` will be sized to at least `(numRows + 1) * accessor.OFFSET_WIDTH` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20397#discussion_r164248501 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java --- @@ -30,21 +30,21 @@ @InterfaceStability.Evolving public interface SupportsScanColumnarBatch extends DataSourceV2Reader { @Override - default ListcreateReadTasks() { + default List createDataReaderFactories() { --- End diff -- We shall create only one `DataReaderFactory`, and have that create multiple data readers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164248098 --- Diff: python/pyspark/sql/window.py --- @@ -124,16 +124,19 @@ def rangeBetween(start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). """ -if start <= Window._PRECEDING_THRESHOLD: -start = Window.unboundedPreceding -if end >= Window._FOLLOWING_THRESHOLD: -end = Window.unboundedFollowing +if isinstance(start, int) and isinstance(end, int): --- End diff -- oh, thanks @HyukjinKwon , I'm not familiar with python :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20412 **[Test build #4078 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4078/testReport)** for PR 20412 at commit [`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164245801 --- Diff: python/pyspark/sql/window.py --- @@ -124,16 +124,19 @@ def rangeBetween(start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). """ -if start <= Window._PRECEDING_THRESHOLD: -start = Window.unboundedPreceding -if end >= Window._FOLLOWING_THRESHOLD: -end = Window.unboundedFollowing +if isinstance(start, int) and isinstance(end, int): --- End diff -- FYI, Python 3 doesn't have `long` and it was merged to `int`. We should do `long` here and assign `int` to `long` in Python 3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20410: [SPARK-23234][ML][PYSPARK] Remove setting default...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20410#discussion_r164243811 --- Diff: python/pyspark/ml/wrapper.py --- @@ -118,10 +118,9 @@ def _transfer_params_to_java(self): """ Transforms the embedded params to the companion Java object. """ -paramMap = self.extractParamMap() for param in self.params: -if param in paramMap: -pair = self._make_java_param_pair(param, paramMap[param]) +if param in self._paramMap: +pair = self._make_java_param_pair(param, self._paramMap[param]) --- End diff -- The intent would be more clear if you called ``` if self.isSet(param): pair = self._make_java_param_pair(param, self.getOrDefault[param]) ``` And to be 100% consistent with Scala, it might be a good idea to transfer default values anyway. That way, just in case the python default was somehow changed or different than scala it wouldn't cause an issue that would be really hard to detect. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20409 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20398 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20398 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86723/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataF...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20393 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20398 **[Test build #86723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86723/testReport)** for PR 20398 at commit [`3093d91`](https://github.com/apache/spark/commit/3093d919ada739fca10b1fcc73e9b2c620eeeb16). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 I opened https://issues.apache.org/jira/browse/SPARK-23243 to track the RDD.repartition() patch, thanks for all the discussions! @shivaram @mridulm @sameeragarwal @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20393 LGTM but we should get a broader consensus on this. In the meantime, I'm merging this patch to master/2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20409: [SPARK-23233][PYTHON] Reset the cache in asNondet...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20409#discussion_r164242518 --- Diff: python/pyspark/sql/udf.py --- @@ -188,6 +188,9 @@ def asNondeterministic(self): .. versionadded:: 2.3 """ +# Here, we explicitly clean the cache to create a JVM UDF instance +# with 'deterministic' updated. See SPARK-23233. +self._judf_placeholder = None --- End diff -- Do you mean the reproducer above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20408: [SPARK-23189][Core][Web UI] Reflect stage level blacklis...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/20408 I'll take a look at the code when I have a moment, but from a UI perspective only have one issue. Having the status of `Blacklisted in Stages: [...]` when an exec is Active could be easily misunderstood. I think it should still say `Active`, though `Active (Blacklisted in Stages: [...])` is a bit wordy so I'm not sure how best to do so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 Updated the title, does it sound good to have this PR? I'll open another one to address the RDD.repartition() issue (which will target to 2.4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20393 Another (possibly cleaner) approach here would be to make the shuffle block fetch order deterministic but I agree that it might not be safe to include it in 2.3 this late. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20412 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20412 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86724/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20412 **[Test build #86724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86724/testReport)** for PR 20412 at commit [`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class KafkaSourceSuiteBase extends KafkaSourceTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20412 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20396 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove setting defaults on Ja...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20410 I think that the problem is not SPARK-22797. The problem is that before this PR, the Python API considers as Defined but not Set all the parameters with a default value, while the Scala/Java class representing it considers as Set all them. This has come up in this case, but it can cause other problems in the future and also now, because it creates an inconsistency between the Python API and the representation in the JVM backend. Thus I do believe that this PR is needed, and it is not only a fix for the test failures. I think this is a first step and a second step would later be to drop all the `setDefault` in the Python API, in favor of retrieving them from the JVM backend. In this way, we will be sure there is no logical inconsistency between the API and the backend. Unfortunately, this second part is much bigger and has a large impact. So I think it best would need a design doc or something similar. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20332#discussion_r164237753 --- Diff: docs/ml-classification-regression.md --- @@ -111,10 +110,9 @@ Continuing the earlier example: [`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html) provides a summary for a [`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html). -Currently, only binary classification is supported and the -summary must be explicitly cast to -[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html). -Support for multiclass model summaries will be added in the future. +In the case of binary classification, certain additional metrics are --- End diff -- Now `BinaryLogisticRegressionTrainingSummary` inherits `LogisticRegressionSummary` so that inherits all metrics in `LogisticRegressionSummary`. We'd better mark them in doc. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86726/testReport)** for PR 20403 at commit [`1f4d288`](https://github.com/apache/spark/commit/1f4d2884ba5b56e06427ce3d91cb6ac5f8f2b7b6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/302/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20208 Hi, @rxin , @cloud-fan , @sameeragarwal , @HyukjinKwon . Could you give me some opinions about this PR? I know that Xiao Li is busy for this period, so I didn't ping hime. For me, this PR is important. Sorry for annoying you guys. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20403 this has been failing due to #19892 which was recently reverted --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20403 jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20412 **[Test build #86724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86724/testReport)** for PR 20412 at commit [`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #86725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86725/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/301/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/300/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20412 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20412 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/299/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20412 LGTM. My bad, I should caught this in the original PR that @jose-torres made. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20208 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20407 **[Test build #86722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86722/testReport)** for PR 20407 at commit [`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20398 **[Test build #86723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86723/testReport)** for PR 20398 at commit [`3093d91`](https://github.com/apache/spark/commit/3093d919ada739fca10b1fcc73e9b2c620eeeb16). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaS...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/20412 [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSuiteBase twice ## What changes were proposed in this pull request? KafkaSourceSuiteBase should be abstract class, otherwise KafkaSourceSuiteBase will also run. ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-23242 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20412.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20412 commit bd124b3c23a60200c83c25b6c786795a188bb9aa Author: Shixiong ZhuDate: 2018-01-26T22:04:12Z Don't run KafkaSourceSuiteBase --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20407 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20407 SPARK-23234 is reverted now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20393 @jiangxb1987 Btw, we could argue this is a correctness issue since we added repartition - so not necessarily blocker :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org