[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652404068



##
File path: python/docs/source/development/ps_contributing.rst
##
@@ -1,192 +0,0 @@
-==
-Contributing Guide
-==
-
-.. contents:: Table of contents:
-   :depth: 1
-   :local:
-
-Types of Contributions
-==
-
-The largest amount of work consists simply of implementing the pandas API 
using Spark's built-in functions, which is usually straightforward. But there 
are many different forms of contributions in addition to writing code:
-
-1. Use the project and provide feedback, by creating new tickets or commenting 
on existing relevant tickets.
-
-2. Review existing pull requests.
-
-3. Improve the project's documentation.
-
-4. Write blog posts or tutorial articles evangelizing pandas API on Spark and 
help new users learn pandas API on Spark.
-
-5. Give a talk about pandas API on Spark at your local meetup or a conference.
-
-
-Step-by-step Guide For Code Contributions
-=
-
-1. Read and understand the `Design Principles `_ for the project. 
Contributions should follow these principles.
-
-2. Signaling your work: If you are working on something, comment on the 
relevant ticket that you are doing so to avoid multiple people taking on the 
same work at the same time. It is also a good practice to signal that your work 
has stalled or you have moved on and want somebody else to take over.
-
-3. Understand what the functionality is in pandas or in Spark.
-
-4. Implement the functionality, with test cases providing close to 100% 
statement coverage. Document the functionality.
-
-5. Run existing and new test cases to make sure they still pass. Also run 
`dev/reformat` script to reformat Python files by using `Black 
`_, and run the linter `dev/lint-python`.
-
-6. Build the docs (`make html` in `docs` directory) and verify the docs 
related to your change look OK.
-
-7. Submit a pull request, and be responsive to code review feedback from other 
community members.
-
-That's it. Your contribution, once merged, will be available in the next 
release.
-
-
-Environment Setup
-=
-
-Conda
--
-
-If you are using Conda, the pandas API on Spark installation and development 
environment are as follows.
-
-.. code-block:: bash
-
-# Python 3.6+ is required
-conda create --name koalas-dev-env python=3.6
-conda activate koalas-dev-env
-conda install -c conda-forge pyspark=2.4
-pip install -r requirements-dev.txt
-pip install -e .  # installs koalas from current checkout
-
-Once setup, make sure you switch to `koalas-dev-env` before development:
-
-.. code-block:: bash
-
-conda activate koalas-dev-env
-
-pip

-
-With Python 3.6+, pip can be used as below to install and set up the 
development environment.
-
-.. code-block:: bash
-
-pip install pyspark==2.4
-pip install -r requirements-dev.txt
-pip install -e .  # installs koalas from current checkout
-
-Running Tests
-=
-
-There is a script `./dev/pytest` which is exactly same as `pytest` but with 
some default settings to run the tests easily.
-
-To run all the tests, similar to our CI pipeline:
-
-.. code-block:: bash
-
-# Run all unittest and doctest
-./dev/pytest
-
-To run a specific test file:
-
-.. code-block:: bash
-
-# Run unittest
-./dev/pytest -k test_dataframe.py
-
-# Run doctest
-./dev/pytest -k series.py --doctest-modules databricks
-
-To run a specific doctest/unittest:
-
-.. code-block:: bash
-
-# Run unittest
-./dev/pytest -k "DataFrameTest and test_Dataframe"
-
-# Run doctest
-./dev/pytest -k DataFrame.corr --doctest-modules databricks
-
-Note that `-k` is used for simplicity although it takes an expression. You can 
use `--verbose` to check what to filter. See `pytest --help` for more details.
-
-
-Building Documentation

Review comment:
   Removed as it's a duplicate with 
https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-documentation-changes
 and https://github.com/apache/spark/blob/master/docs/README.md




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652404324



##
File path: python/docs/source/development/ps_contributing.rst
##
@@ -1,192 +0,0 @@
-==
-Contributing Guide
-==
-
-.. contents:: Table of contents:
-   :depth: 1
-   :local:
-
-Types of Contributions
-==
-
-The largest amount of work consists simply of implementing the pandas API 
using Spark's built-in functions, which is usually straightforward. But there 
are many different forms of contributions in addition to writing code:
-
-1. Use the project and provide feedback, by creating new tickets or commenting 
on existing relevant tickets.
-
-2. Review existing pull requests.
-
-3. Improve the project's documentation.
-
-4. Write blog posts or tutorial articles evangelizing pandas API on Spark and 
help new users learn pandas API on Spark.
-
-5. Give a talk about pandas API on Spark at your local meetup or a conference.
-
-
-Step-by-step Guide For Code Contributions
-=
-
-1. Read and understand the `Design Principles `_ for the project. 
Contributions should follow these principles.
-
-2. Signaling your work: If you are working on something, comment on the 
relevant ticket that you are doing so to avoid multiple people taking on the 
same work at the same time. It is also a good practice to signal that your work 
has stalled or you have moved on and want somebody else to take over.
-
-3. Understand what the functionality is in pandas or in Spark.
-
-4. Implement the functionality, with test cases providing close to 100% 
statement coverage. Document the functionality.
-
-5. Run existing and new test cases to make sure they still pass. Also run 
`dev/reformat` script to reformat Python files by using `Black 
`_, and run the linter `dev/lint-python`.
-
-6. Build the docs (`make html` in `docs` directory) and verify the docs 
related to your change look OK.
-
-7. Submit a pull request, and be responsive to code review feedback from other 
community members.
-
-That's it. Your contribution, once merged, will be available in the next 
release.
-
-
-Environment Setup
-=
-
-Conda
--
-
-If you are using Conda, the pandas API on Spark installation and development 
environment are as follows.
-
-.. code-block:: bash
-
-# Python 3.6+ is required
-conda create --name koalas-dev-env python=3.6
-conda activate koalas-dev-env
-conda install -c conda-forge pyspark=2.4
-pip install -r requirements-dev.txt
-pip install -e .  # installs koalas from current checkout
-
-Once setup, make sure you switch to `koalas-dev-env` before development:
-
-.. code-block:: bash
-
-conda activate koalas-dev-env
-
-pip

-
-With Python 3.6+, pip can be used as below to install and set up the 
development environment.
-
-.. code-block:: bash
-
-pip install pyspark==2.4
-pip install -r requirements-dev.txt
-pip install -e .  # installs koalas from current checkout
-
-Running Tests
-=
-
-There is a script `./dev/pytest` which is exactly same as `pytest` but with 
some default settings to run the tests easily.
-
-To run all the tests, similar to our CI pipeline:
-
-.. code-block:: bash
-
-# Run all unittest and doctest
-./dev/pytest
-
-To run a specific test file:
-
-.. code-block:: bash
-
-# Run unittest
-./dev/pytest -k test_dataframe.py
-
-# Run doctest
-./dev/pytest -k series.py --doctest-modules databricks
-
-To run a specific doctest/unittest:
-
-.. code-block:: bash
-
-# Run unittest
-./dev/pytest -k "DataFrameTest and test_Dataframe"
-
-# Run doctest
-./dev/pytest -k DataFrame.corr --doctest-modules databricks
-
-Note that `-k` is used for simplicity although it takes an expression. You can 
use `--verbose` to check what to filter. See `pytest --help` for more details.
-
-
-Building Documentation
-==
-
-To build documentation via Sphinx:
-
-.. code-block:: bash
-
- cd docs && make clean html
-
-It generates HTMLs under `docs/build/html` directory. Open 
`docs/build/html/index.html` to check if documentation is built properly.
-
-
-Coding Conventions

Review comment:
   Removed. duplicate with 
https://spark.apache.org/docs/latest/api/python/development/contributing.html#code-and-docstring-guide
 and https://spark.apache.org/contributing.html

##
File path: python/docs/source/development/ps_contributing.rst
##
@@ -1,192 +0,0 @@
-==
-Contributing Guide
-==
-
-.. contents:: Table of contents:
-   :depth: 1
-   :local:
-
-Types of Contributions
-==
-
-The largest amount of work consists simply of implementing the pandas API 
using Spark's built-in functions, which is usually straightforward. But there 
are many different forms of contributions in addition to writing c

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652404681



##
File path: python/docs/source/development/ps_contributing.rst
##
@@ -1,192 +0,0 @@
-==
-Contributing Guide
-==
-
-.. contents:: Table of contents:
-   :depth: 1
-   :local:
-
-Types of Contributions
-==
-
-The largest amount of work consists simply of implementing the pandas API 
using Spark's built-in functions, which is usually straightforward. But there 
are many different forms of contributions in addition to writing code:
-
-1. Use the project and provide feedback, by creating new tickets or commenting 
on existing relevant tickets.
-
-2. Review existing pull requests.
-
-3. Improve the project's documentation.
-
-4. Write blog posts or tutorial articles evangelizing pandas API on Spark and 
help new users learn pandas API on Spark.
-
-5. Give a talk about pandas API on Spark at your local meetup or a conference.
-
-
-Step-by-step Guide For Code Contributions
-=
-
-1. Read and understand the `Design Principles `_ for the project. 
Contributions should follow these principles.
-
-2. Signaling your work: If you are working on something, comment on the 
relevant ticket that you are doing so to avoid multiple people taking on the 
same work at the same time. It is also a good practice to signal that your work 
has stalled or you have moved on and want somebody else to take over.
-
-3. Understand what the functionality is in pandas or in Spark.
-
-4. Implement the functionality, with test cases providing close to 100% 
statement coverage. Document the functionality.
-
-5. Run existing and new test cases to make sure they still pass. Also run 
`dev/reformat` script to reformat Python files by using `Black 
`_, and run the linter `dev/lint-python`.
-
-6. Build the docs (`make html` in `docs` directory) and verify the docs 
related to your change look OK.
-
-7. Submit a pull request, and be responsive to code review feedback from other 
community members.
-
-That's it. Your contribution, once merged, will be available in the next 
release.
-
-
-Environment Setup
-=
-
-Conda
--
-
-If you are using Conda, the pandas API on Spark installation and development 
environment are as follows.
-
-.. code-block:: bash
-
-# Python 3.6+ is required
-conda create --name koalas-dev-env python=3.6
-conda activate koalas-dev-env
-conda install -c conda-forge pyspark=2.4
-pip install -r requirements-dev.txt
-pip install -e .  # installs koalas from current checkout
-
-Once setup, make sure you switch to `koalas-dev-env` before development:
-
-.. code-block:: bash
-
-conda activate koalas-dev-env
-
-pip

-
-With Python 3.6+, pip can be used as below to install and set up the 
development environment.
-
-.. code-block:: bash
-
-pip install pyspark==2.4
-pip install -r requirements-dev.txt
-pip install -e .  # installs koalas from current checkout
-
-Running Tests
-=
-
-There is a script `./dev/pytest` which is exactly same as `pytest` but with 
some default settings to run the tests easily.
-
-To run all the tests, similar to our CI pipeline:
-
-.. code-block:: bash
-
-# Run all unittest and doctest
-./dev/pytest
-
-To run a specific test file:
-
-.. code-block:: bash
-
-# Run unittest
-./dev/pytest -k test_dataframe.py
-
-# Run doctest
-./dev/pytest -k series.py --doctest-modules databricks
-
-To run a specific doctest/unittest:
-
-.. code-block:: bash
-
-# Run unittest
-./dev/pytest -k "DataFrameTest and test_Dataframe"
-
-# Run doctest
-./dev/pytest -k DataFrame.corr --doctest-modules databricks
-
-Note that `-k` is used for simplicity although it takes an expression. You can 
use `--verbose` to check what to filter. See `pytest --help` for more details.
-
-
-Building Documentation
-==
-
-To build documentation via Sphinx:
-
-.. code-block:: bash
-
- cd docs && make clean html
-
-It generates HTMLs under `docs/build/html` directory. Open 
`docs/build/html/index.html` to check if documentation is built properly.
-
-
-Coding Conventions
-==
-
-We follow `PEP 8 `_ with one 
exception: lines can be up to 100 characters in length, not 79.
-
-Doctest Conventions
-===
-
-When writing doctests, usually the doctests in pandas are converted into 
pandas API on Spark to make sure the same codes work in pandas API on Spark.
-In general, doctests should be grouped logically by separating a newline.
-
-For instance, the first block is for the statements for preparation, the 
second block is for using the function with a specific argument,
-and third block is for another argument. As a example, please refer 
`DataFrame.rsub 


[GitHub] [spark] Yikun commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


Yikun commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652403201



##
File path: python/docs/source/development/contributing.rst
##
@@ -72,17 +72,86 @@ Preparing to Contribute Code Changes
 
 
 Before starting to work on codes in PySpark, it is recommended to read `the 
general guidelines `_.
-There are a couple of additional notes to keep in mind when contributing to 
codes in PySpark:
+Additionally, there are a couple of additional notes to keep in mind when 
contributing to codes in PySpark:
+
+* **Be Pythonic.**
+* **APIs are matched with Scala and Java sides in general.**
+* **PySpark specific APIs can still be considered as long as they are Pythonic 
and do not conflict with other existent APIs, for example, decorator usage of 
UDFs.**
+* **If you extend or modify public API, please adjust corresponding type 
hints. See `Contributing and Maintaining Type Hints`_ for details.**
+
+If you are fixing pandas API on Spark (``pyspark.pandas``) package, please 
consider the design principles below:
+
+* **Return pandas-on-Spark data structure for big data, and pandas data 
structure for small data**
+Often developers face the question whether a particular function should 
return a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The 
principle is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
``DataFrame.head()`` or ``Series.unique()`` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
+
+* **Provide discoverable APIs for common data science tasks**
+At the risk of overgeneralization, there are two API design approaches: 
the first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enables users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the latter.
+
+One example is value count (count by some key column), one of the most 
common operations in data science. pandas ``DataFrame.value_count`` returns the 
result in sorted order, which in 90% of the cases is what users prefer when 
exploring data, whereas Spark's does not sort, which is more desirable when 
building data pipelines, as users can accomplish the pandas behavior by adding 
an explicit ``orderBy``.
+
+Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
+
+* **Guardrails to prevent users from shooting themselves in the foot**
+Certain operations in pandas are prohibitively expensive as data scales, 
and we don't want to give users the illusion that they can rely on such 
operations in pandas API on Spark. That is to say, methods implemented in 
pandas API on Spark should be safe to perform by default on large datasets. As 
a result, the following capabilities are not implemented in pandas API on Spark:
+
+1. Capabilities that are fundamentally not parallelizable: e.g. 
imperatively looping over each element
+2. Capabilities that require materializing the entire working set in a 
single node's memory. This is why we do not implement 
`pandas.DataFrame.to_xarray 
`_.
 Another example is the ``_repr_html_`` call caps the total number of records 
shown to a maximum of 1000, to prevent users from blowing up their driver node 
simply by typing the name of the DataFrame in a notebook.
+
+A few exceptions, however, exist. One common pattern with "big data 
science" is that while the initial dataset is large, the working set becomes 
smaller as the analysis goes deeper. For example, data scientists often perform 
aggregation on datasets and want to then convert the aggregated dataset to some 
local data structure. To help data scientists, we offer the following:
+
+* :func:`DataFrame.to_pandas` that returns a pandas DataFrame, koalas only

Review comment:
   koalas only  --> pandas API on Spark only?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: r

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652405522



##
File path: python/docs/source/development/ps_design.rst
##
@@ -1,85 +0,0 @@
-=
-Design Principles
-=
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic

Review comment:
   Partially moved and merged (especially about the snake_style).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32899: [SPARK-35652][SQL][3.0] joinWith on two table generated from same one

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32899:
URL: https://github.com/apache/spark/pull/32899#issuecomment-862105274


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44369/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32882: [WIP][SPARK-35724][SQL] Support traversal pruning in extendedResolutionRules and postHocResolutionRules

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32882:
URL: https://github.com/apache/spark/pull/32882#issuecomment-862105283


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139848/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32914: [SPARK-35763][SS] Add a new copy method to StateStoreCustomMetric

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32914:
URL: https://github.com/apache/spark/pull/32914#issuecomment-862105282


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44367/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExe

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862105275






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862105276






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32922: [SPARK-35774][SQL] Parse any year-month interval types in SQL

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32922:
URL: https://github.com/apache/spark/pull/32922#issuecomment-862105279


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44373/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #32693: [SPARK-35556][SQL][TESTS] Avoid log NoSuchMethodError when running multiple Hive version related tests

2021-06-16 Thread GitBox


wangyum commented on pull request #32693:
URL: https://github.com/apache/spark/pull/32693#issuecomment-862105731


   It seems `TmpOutputFile` and `TmpErrOutputFile`  are generated by 
[SessionState.start(state)](https://github.com/apache/spark/blob/ebb4858f7185c6525adc4b23bc89f0a8262bf940/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L829).
   
   I think we can remove these lines:
   
https://github.com/apache/spark/blob/ebb4858f7185c6525adc4b23bc89f0a8262bf940/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L158-L175
   and add `state.close()` to `runHive`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652406420



##
File path: python/docs/source/development/ps_design.rst
##
@@ -1,85 +0,0 @@
-=
-Design Principles
-=
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic

-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first

Review comment:
   This is removed. Now pandas-on-Spark specific APIs are placed under 
`DataFrame.pandas_on_spark`. PySpark APIs are placed under `DataFrame.spark` 
for now. `DataFrame` mainly only hold pandas APIs only.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652406557



##
File path: python/docs/source/development/ps_design.rst
##
@@ -1,85 +0,0 @@
-=
-Design Principles
-=
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic

-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first
-
-
-The pandas-on-Spark DataFrame is meant to provide the best of pandas and Spark 
under a single API, with easy and clear conversions between each API when 
necessary. When Spark and pandas have similar APIs with subtle differences, the 
principle is to honor the contract of the pandas API first.
-
-There are different classes of functions:
-
- 1. Functions that are found in both Spark and pandas under the same name 
(`count`, `dtypes`, `head`). The return value is the same as the return type in 
pandas (and not Spark's).
-
- 2. Functions that are found in Spark but that have a clear equivalent in 
pandas, e.g. `alias` and `rename`. These functions will be implemented as the 
alias of the pandas function, but should be marked that they are aliases of the 
same functions. They are provided so that existing users of PySpark can get the 
benefits of pandas API on Spark without having to adapt their code.
- 
- 3. Functions that are only found in pandas. When these functions are 
appropriate for distributed datasets, they should become available in pandas 
API on Spark.
- 
- 4. Functions that are only found in Spark that are essential to controlling 
the distributed nature of the computations, e.g. `cache`. These functions 
should be available in pandas API on Spark.
-
-We are still debating whether data transformation functions only available in 
Spark should be added to pandas API on Spark, e.g. `select`. We would love to 
hear your feedback on that.
-
-Return pandas-on-Spark data structure for big data, and pandas data structure 
for small data

Review comment:
   Moved and merged.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652406643



##
File path: python/docs/source/development/ps_design.rst
##
@@ -1,85 +0,0 @@
-=
-Design Principles
-=
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic

-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first
-
-
-The pandas-on-Spark DataFrame is meant to provide the best of pandas and Spark 
under a single API, with easy and clear conversions between each API when 
necessary. When Spark and pandas have similar APIs with subtle differences, the 
principle is to honor the contract of the pandas API first.
-
-There are different classes of functions:
-
- 1. Functions that are found in both Spark and pandas under the same name 
(`count`, `dtypes`, `head`). The return value is the same as the return type in 
pandas (and not Spark's).
-
- 2. Functions that are found in Spark but that have a clear equivalent in 
pandas, e.g. `alias` and `rename`. These functions will be implemented as the 
alias of the pandas function, but should be marked that they are aliases of the 
same functions. They are provided so that existing users of PySpark can get the 
benefits of pandas API on Spark without having to adapt their code.
- 
- 3. Functions that are only found in pandas. When these functions are 
appropriate for distributed datasets, they should become available in pandas 
API on Spark.
- 
- 4. Functions that are only found in Spark that are essential to controlling 
the distributed nature of the computations, e.g. `cache`. These functions 
should be available in pandas API on Spark.
-
-We are still debating whether data transformation functions only available in 
Spark should be added to pandas API on Spark, e.g. `select`. We would love to 
hear your feedback on that.
-
-Return pandas-on-Spark data structure for big data, and pandas data structure 
for small data
-
-
-Often developers face the question whether a particular function should return 
a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The principle 
is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
`DataFrame.head()` or `Series.unique()` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
-
-Provide discoverable APIs for common data science tasks

Review comment:
   Moved and merged




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652406807



##
File path: python/docs/source/development/ps_design.rst
##
@@ -1,85 +0,0 @@
-=
-Design Principles
-=
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic

-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first
-
-
-The pandas-on-Spark DataFrame is meant to provide the best of pandas and Spark 
under a single API, with easy and clear conversions between each API when 
necessary. When Spark and pandas have similar APIs with subtle differences, the 
principle is to honor the contract of the pandas API first.
-
-There are different classes of functions:
-
- 1. Functions that are found in both Spark and pandas under the same name 
(`count`, `dtypes`, `head`). The return value is the same as the return type in 
pandas (and not Spark's).
-
- 2. Functions that are found in Spark but that have a clear equivalent in 
pandas, e.g. `alias` and `rename`. These functions will be implemented as the 
alias of the pandas function, but should be marked that they are aliases of the 
same functions. They are provided so that existing users of PySpark can get the 
benefits of pandas API on Spark without having to adapt their code.
- 
- 3. Functions that are only found in pandas. When these functions are 
appropriate for distributed datasets, they should become available in pandas 
API on Spark.
- 
- 4. Functions that are only found in Spark that are essential to controlling 
the distributed nature of the computations, e.g. `cache`. These functions 
should be available in pandas API on Spark.
-
-We are still debating whether data transformation functions only available in 
Spark should be added to pandas API on Spark, e.g. `select`. We would love to 
hear your feedback on that.
-
-Return pandas-on-Spark data structure for big data, and pandas data structure 
for small data
-
-
-Often developers face the question whether a particular function should return 
a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The principle 
is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
`DataFrame.head()` or `Series.unique()` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
-
-Provide discoverable APIs for common data science tasks

-
-At the risk of overgeneralization, there are two API design approaches: the 
first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enable users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the later.
-
-One example is value count (count by some key column), one of the most common 
operations in data science. pandas `DataFrame.value_count` returns the result 
in sorted order, which in 90% of the cases is what users prefer when exploring 
data, whereas Spark's does not sort, which is more desirable when building data 
pipelines, as users can accomplish the pandas behavior by adding an explicit 
`orderBy`.
-
-Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
-
-Provide well documented APIs, with examples

Review comment:
   Removed as it's duplicate of https://spark.apache.org/contri

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652407133



##
File path: python/docs/source/development/ps_design.rst
##
@@ -1,85 +0,0 @@
-=
-Design Principles
-=
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic

-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first
-
-
-The pandas-on-Spark DataFrame is meant to provide the best of pandas and Spark 
under a single API, with easy and clear conversions between each API when 
necessary. When Spark and pandas have similar APIs with subtle differences, the 
principle is to honor the contract of the pandas API first.
-
-There are different classes of functions:
-
- 1. Functions that are found in both Spark and pandas under the same name 
(`count`, `dtypes`, `head`). The return value is the same as the return type in 
pandas (and not Spark's).
-
- 2. Functions that are found in Spark but that have a clear equivalent in 
pandas, e.g. `alias` and `rename`. These functions will be implemented as the 
alias of the pandas function, but should be marked that they are aliases of the 
same functions. They are provided so that existing users of PySpark can get the 
benefits of pandas API on Spark without having to adapt their code.
- 
- 3. Functions that are only found in pandas. When these functions are 
appropriate for distributed datasets, they should become available in pandas 
API on Spark.
- 
- 4. Functions that are only found in Spark that are essential to controlling 
the distributed nature of the computations, e.g. `cache`. These functions 
should be available in pandas API on Spark.
-
-We are still debating whether data transformation functions only available in 
Spark should be added to pandas API on Spark, e.g. `select`. We would love to 
hear your feedback on that.
-
-Return pandas-on-Spark data structure for big data, and pandas data structure 
for small data
-
-
-Often developers face the question whether a particular function should return 
a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The principle 
is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
`DataFrame.head()` or `Series.unique()` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
-
-Provide discoverable APIs for common data science tasks

-
-At the risk of overgeneralization, there are two API design approaches: the 
first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enable users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the later.
-
-One example is value count (count by some key column), one of the most common 
operations in data science. pandas `DataFrame.value_count` returns the result 
in sorted order, which in 90% of the cases is what users prefer when exploring 
data, whereas Spark's does not sort, which is more desirable when building data 
pipelines, as users can accomplish the pandas behavior by adding an explicit 
`orderBy`.
-
-Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
-
-Provide well documented APIs, with examples

-
-All functions and parameters should 

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652407018



##
File path: python/docs/source/development/ps_design.rst
##
@@ -1,85 +0,0 @@
-=
-Design Principles
-=
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic

-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first
-
-
-The pandas-on-Spark DataFrame is meant to provide the best of pandas and Spark 
under a single API, with easy and clear conversions between each API when 
necessary. When Spark and pandas have similar APIs with subtle differences, the 
principle is to honor the contract of the pandas API first.
-
-There are different classes of functions:
-
- 1. Functions that are found in both Spark and pandas under the same name 
(`count`, `dtypes`, `head`). The return value is the same as the return type in 
pandas (and not Spark's).
-
- 2. Functions that are found in Spark but that have a clear equivalent in 
pandas, e.g. `alias` and `rename`. These functions will be implemented as the 
alias of the pandas function, but should be marked that they are aliases of the 
same functions. They are provided so that existing users of PySpark can get the 
benefits of pandas API on Spark without having to adapt their code.
- 
- 3. Functions that are only found in pandas. When these functions are 
appropriate for distributed datasets, they should become available in pandas 
API on Spark.
- 
- 4. Functions that are only found in Spark that are essential to controlling 
the distributed nature of the computations, e.g. `cache`. These functions 
should be available in pandas API on Spark.
-
-We are still debating whether data transformation functions only available in 
Spark should be added to pandas API on Spark, e.g. `select`. We would love to 
hear your feedback on that.
-
-Return pandas-on-Spark data structure for big data, and pandas data structure 
for small data
-
-
-Often developers face the question whether a particular function should return 
a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The principle 
is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
`DataFrame.head()` or `Series.unique()` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
-
-Provide discoverable APIs for common data science tasks

-
-At the risk of overgeneralization, there are two API design approaches: the 
first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enable users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the later.
-
-One example is value count (count by some key column), one of the most common 
operations in data science. pandas `DataFrame.value_count` returns the result 
in sorted order, which in 90% of the cases is what users prefer when exploring 
data, whereas Spark's does not sort, which is more desirable when building data 
pipelines, as users can accomplish the pandas behavior by adding an explicit 
`orderBy`.
-
-Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
-
-Provide well documented APIs, with examples

-
-All functions and parameters should 

[GitHub] [spark] SparkQA commented on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


SparkQA commented on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862107624


   **[Test build #139852 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139852/testReport)**
 for PR 32926 at commit 
[`03ccfe4`](https://github.com/apache/spark/commit/03ccfe48a403c6ade7b1d7d3dabd80c686f52f13).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


SparkQA commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862107746


   **[Test build #139853 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139853/testReport)**
 for PR 32907 at commit 
[`e2dfbce`](https://github.com/apache/spark/commit/e2dfbceb6835c0debacb8baed7387b63c9b3ccb8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


SparkQA commented on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862107867






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862107926


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139854/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32787: [SPARK-35618][SQL] Resolve star expressions in subqueries using outer query plans

2021-06-16 Thread GitBox


SparkQA commented on pull request #32787:
URL: https://github.com/apache/spark/pull/32787#issuecomment-862107995


   **[Test build #139855 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139855/testReport)**
 for PR 32787 at commit 
[`e0460c5`](https://github.com/apache/spark/commit/e0460c52c8d0d99eb06a618228ff7cd51b9c97ab).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652408183



##
File path: python/docs/source/development/contributing.rst
##
@@ -72,17 +72,86 @@ Preparing to Contribute Code Changes
 
 
 Before starting to work on codes in PySpark, it is recommended to read `the 
general guidelines `_.
-There are a couple of additional notes to keep in mind when contributing to 
codes in PySpark:
+Additionally, there are a couple of additional notes to keep in mind when 
contributing to codes in PySpark:
+
+* **Be Pythonic.**
+* **APIs are matched with Scala and Java sides in general.**
+* **PySpark specific APIs can still be considered as long as they are Pythonic 
and do not conflict with other existent APIs, for example, decorator usage of 
UDFs.**
+* **If you extend or modify public API, please adjust corresponding type 
hints. See `Contributing and Maintaining Type Hints`_ for details.**
+
+If you are fixing pandas API on Spark (``pyspark.pandas``) package, please 
consider the design principles below:
+
+* **Return pandas-on-Spark data structure for big data, and pandas data 
structure for small data**
+Often developers face the question whether a particular function should 
return a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The 
principle is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
``DataFrame.head()`` or ``Series.unique()`` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
+
+* **Provide discoverable APIs for common data science tasks**
+At the risk of overgeneralization, there are two API design approaches: 
the first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enables users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the latter.
+
+One example is value count (count by some key column), one of the most 
common operations in data science. pandas ``DataFrame.value_count`` returns the 
result in sorted order, which in 90% of the cases is what users prefer when 
exploring data, whereas Spark's does not sort, which is more desirable when 
building data pipelines, as users can accomplish the pandas behavior by adding 
an explicit ``orderBy``.
+
+Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
+
+* **Guardrails to prevent users from shooting themselves in the foot**
+Certain operations in pandas are prohibitively expensive as data scales, 
and we don't want to give users the illusion that they can rely on such 
operations in pandas API on Spark. That is to say, methods implemented in 
pandas API on Spark should be safe to perform by default on large datasets. As 
a result, the following capabilities are not implemented in pandas API on Spark:
+
+1. Capabilities that are fundamentally not parallelizable: e.g. 
imperatively looping over each element
+2. Capabilities that require materializing the entire working set in a 
single node's memory. This is why we do not implement 
`pandas.DataFrame.to_xarray 
`_.
 Another example is the ``_repr_html_`` call caps the total number of records 
shown to a maximum of 1000, to prevent users from blowing up their driver node 
simply by typing the name of the DataFrame in a notebook.
+
+A few exceptions, however, exist. One common pattern with "big data 
science" is that while the initial dataset is large, the working set becomes 
smaller as the analysis goes deeper. For example, data scientists often perform 
aggregation on datasets and want to then convert the aggregated dataset to some 
local data structure. To help data scientists, we offer the following:
+
+* :func:`DataFrame.to_pandas` that returns a pandas DataFrame, koalas only

Review comment:
   ```suggestion
   * :func:`DataFrame.to_pandas` that returns a pandas DataFrame 
(pandas-on-Spark only)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



--

[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

2021-06-16 Thread GitBox


SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862108939


   **[Test build #139856 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)**
 for PR 31179 at commit 
[`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28885: [WIP][SPARK-29375][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse

2021-06-16 Thread GitBox


SparkQA commented on pull request #28885:
URL: https://github.com/apache/spark/pull/28885#issuecomment-862109061


   **[Test build #139857 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139857/testReport)**
 for PR 28885 at commit 
[`bf29f1a`](https://github.com/apache/spark/commit/bf29f1a764927b9bf8006d8c950885f6eea24ddd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


HyukjinKwon commented on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862109356


   cc @ueshin @itholic @xinrong-databricks FYI
   @rxin FYI for the changes in the design principles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


SparkQA commented on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862110490


   **[Test build #139858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139858/testReport)**
 for PR 32926 at commit 
[`74dbce4`](https://github.com/apache/spark/commit/74dbce48db1ee466b8c1363990646eb8a9258a68).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


SparkQA commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862111625


   **[Test build #139853 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139853/testReport)**
 for PR 32907 at commit 
[`e2dfbce`](https://github.com/apache/spark/commit/e2dfbceb6835c0debacb8baed7387b63c9b3ccb8).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862111669


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139853/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32919: [SPARK-35378][SQL][FOLLOWUP] Restore the command execution name for DataFrameWriterV2

2021-06-16 Thread GitBox


SparkQA commented on pull request #32919:
URL: https://github.com/apache/spark/pull/32919#issuecomment-862113434


   **[Test build #139859 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139859/testReport)**
 for PR 32919 at commit 
[`adc141d`](https://github.com/apache/spark/commit/adc141d2e36104ce4d79b8ab291d16aa2e5ba0a1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862107746


   **[Test build #139853 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139853/testReport)**
 for PR 32907 at commit 
[`e2dfbce`](https://github.com/apache/spark/commit/e2dfbceb6835c0debacb8baed7387b63c9b3ccb8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.log

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862105275






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32882: [WIP][SPARK-35724][SQL] Support traversal pruning in extendedResolutionRules and postHocResolutionRules

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32882:
URL: https://github.com/apache/spark/pull/32882#issuecomment-862105283


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139848/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32922: [SPARK-35774][SQL] Parse any year-month interval types in SQL

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32922:
URL: https://github.com/apache/spark/pull/32922#issuecomment-862105279


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44373/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32899: [SPARK-35652][SQL][3.0] joinWith on two table generated from same one

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32899:
URL: https://github.com/apache/spark/pull/32899#issuecomment-862105274


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44369/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862107926


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139854/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32914: [SPARK-35763][SS] Add a new copy method to StateStoreCustomMetric

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32914:
URL: https://github.com/apache/spark/pull/32914#issuecomment-862105282


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44367/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

2021-06-16 Thread GitBox


SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862114525


   **[Test build #139860 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139860/testReport)**
 for PR 31179 at commit 
[`7fdf7d0`](https://github.com/apache/spark/commit/7fdf7d0ef79d78bb015eb92cc78bc0f7df607208).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862107867


   **[Test build #139854 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139854/testReport)**
 for PR 32867 at commit 
[`eb835c5`](https://github.com/apache/spark/commit/eb835c5d73b4f844cdd6dff1fd37a2317079c5e6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862105276






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32883: [SPARK-35725][SQL] Support repartition expand partitions in AQE

2021-06-16 Thread GitBox


cloud-fan commented on a change in pull request #32883:
URL: https://github.com/apache/spark/pull/32883#discussion_r652414680



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ExpandShufflePartitions.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.exchange.{EnsureRequirements, 
REPARTITION_BY_COL, REPARTITION_BY_NONE, ShuffleExchangeLike, ShuffleOrigin}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A rule to expand the shuffle partitions based on the map output statistics, 
which can
+ * avoid data skew that hurt performance.
+ *
+ * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition 
should be expanded.
+ * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has 
data skew issue.
+ * the map side looks like:
+ *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
+ * and the reduce side looks like:
+ *  (without this rule) r1[m0-b1, m1-b1, m2-b1]
+ *  /  \
+ *   r0:[m0-b0, m1-b0, m2-b0], r1:[m0-b1], r2:[m1-b1], r3:[m2-b1], r4[m0-b2, 
m1-b2, m2-b2]
+ */
+object ExpandShufflePartitions extends CustomShuffleReaderRule {
+  override def supportedShuffleOrigins: Seq[ShuffleOrigin] =
+Seq(REPARTITION_BY_COL, REPARTITION_BY_NONE)

Review comment:
   I think it only makes sense if the `repartition($"a")` is added by the 
framework to optimize table insertion, not added by users.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115053


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139856/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

2021-06-16 Thread GitBox


SparkQA commented on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115008


   **[Test build #139856 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)**
 for PR 31179 at commit 
[`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862115053


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139856/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31179: [SPARK-34113][SQL] Use metric data update metadata statistic's size and rowCount

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #31179:
URL: https://github.com/apache/spark/pull/31179#issuecomment-862108939


   **[Test build #139856 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139856/testReport)**
 for PR 31179 at commit 
[`4995113`](https://github.com/apache/spark/commit/499511384b2d75ff5b2bf59116d7e29226dc4112).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32883: [SPARK-35725][SQL] Support repartition expand partitions in AQE

2021-06-16 Thread GitBox


cloud-fan commented on a change in pull request #32883:
URL: https://github.com/apache/spark/pull/32883#discussion_r652413457



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ExpandShufflePartitions.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.exchange.{EnsureRequirements, 
REPARTITION_BY_COL, REPARTITION_BY_NONE, ShuffleExchangeLike, ShuffleOrigin}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A rule to expand the shuffle partitions based on the map output statistics, 
which can
+ * avoid data skew that hurt performance.
+ *
+ * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition 
should be expanded.
+ * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has 
data skew issue.
+ * the map side looks like:
+ *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
+ * and the reduce side looks like:
+ *  (without this rule) r1[m0-b1, m1-b1, m2-b1]
+ *  /  \
+ *   r0:[m0-b0, m1-b0, m2-b0], r1:[m0-b1], r2:[m1-b1], r3:[m2-b1], r4[m0-b2, 
m1-b2, m2-b2]
+ */
+object ExpandShufflePartitions extends CustomShuffleReaderRule {
+  override def supportedShuffleOrigins: Seq[ShuffleOrigin] =
+Seq(REPARTITION_BY_COL, REPARTITION_BY_NONE)

Review comment:
   why can we accept `REPARTITION_BY_COL`? If people do 
`df.repartition($"a")`, we should make sure the output is hash partitioned by 
column a, isn't it? even if it's the last operator like 
`df.repartition($"a").collect`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32883: [SPARK-35725][SQL] Support repartition expand partitions in AQE

2021-06-16 Thread GitBox


cloud-fan commented on a change in pull request #32883:
URL: https://github.com/apache/spark/pull/32883#discussion_r652413457



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ExpandShufflePartitions.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.exchange.{EnsureRequirements, 
REPARTITION_BY_COL, REPARTITION_BY_NONE, ShuffleExchangeLike, ShuffleOrigin}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A rule to expand the shuffle partitions based on the map output statistics, 
which can
+ * avoid data skew that hurt performance.
+ *
+ * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition 
should be expanded.
+ * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has 
data skew issue.
+ * the map side looks like:
+ *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
+ * and the reduce side looks like:
+ *  (without this rule) r1[m0-b1, m1-b1, m2-b1]
+ *  /  \
+ *   r0:[m0-b0, m1-b0, m2-b0], r1:[m0-b1], r2:[m1-b1], r3:[m2-b1], r4[m0-b2, 
m1-b2, m2-b2]
+ */
+object ExpandShufflePartitions extends CustomShuffleReaderRule {
+  override def supportedShuffleOrigins: Seq[ShuffleOrigin] =
+Seq(REPARTITION_BY_COL, REPARTITION_BY_NONE)

Review comment:
   why can we accept `REPARTITION_BY_COL`? If people do 
`re.repartition($"a")`, we should make sure the output is hash partitioned by 
column a, isn't it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32883: [SPARK-35725][SQL] Support repartition expand partitions in AQE

2021-06-16 Thread GitBox


cloud-fan commented on a change in pull request #32883:
URL: https://github.com/apache/spark/pull/32883#discussion_r652419451



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ExpandShufflePartitions.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.exchange.{EnsureRequirements, 
REPARTITION_BY_COL, REPARTITION_BY_NONE, ShuffleExchangeLike, ShuffleOrigin}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A rule to expand the shuffle partitions based on the map output statistics, 
which can
+ * avoid data skew that hurt performance.
+ *
+ * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition 
should be expanded.
+ * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has 
data skew issue.
+ * the map side looks like:
+ *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
+ * and the reduce side looks like:
+ *  (without this rule) r1[m0-b1, m1-b1, m2-b1]
+ *  /  \
+ *   r0:[m0-b0, m1-b0, m2-b0], r1:[m0-b1], r2:[m1-b1], r3:[m2-b1], r4[m0-b2, 
m1-b2, m2-b2]
+ */
+object ExpandShufflePartitions extends CustomShuffleReaderRule {
+  override def supportedShuffleOrigins: Seq[ShuffleOrigin] =
+Seq(REPARTITION_BY_COL, REPARTITION_BY_NONE)

Review comment:
   This makes me think that if we need a new operator to do the repartition 
for partitioned table insertion (non partitioned table can use the existing 
operator, thanks to 
https://github.com/apache/spark/commit/ce1636948b1fb8cfb8cc921896dc003949da1085),
 and assign it a new shuffle origin. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32883: [SPARK-35725][SQL] Support repartition expand partitions in AQE

2021-06-16 Thread GitBox


cloud-fan commented on a change in pull request #32883:
URL: https://github.com/apache/spark/pull/32883#discussion_r652419865



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ExpandShufflePartitions.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.exchange.{EnsureRequirements, 
REPARTITION_BY_COL, REPARTITION_BY_NONE, ShuffleExchangeLike, ShuffleOrigin}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A rule to expand the shuffle partitions based on the map output statistics, 
which can
+ * avoid data skew that hurt performance.
+ *
+ * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition 
should be expanded.
+ * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has 
data skew issue.
+ * the map side looks like:
+ *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
+ * and the reduce side looks like:
+ *  (without this rule) r1[m0-b1, m1-b1, m2-b1]
+ *  /  \
+ *   r0:[m0-b0, m1-b0, m2-b0], r1:[m0-b1], r2:[m1-b1], r3:[m2-b1], r4[m0-b2, 
m1-b2, m2-b2]
+ */
+object ExpandShufflePartitions extends CustomShuffleReaderRule {
+  override def supportedShuffleOrigins: Seq[ShuffleOrigin] =
+Seq(REPARTITION_BY_COL, REPARTITION_BY_NONE)

Review comment:
   cc @wangyum 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a change in pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


Yikun commented on a change in pull request #32867:
URL: https://github.com/apache/spark/pull/32867#discussion_r652421042



##
File path: dev/sparktestsupport/modules.py
##
@@ -16,13 +16,65 @@
 #
 
 from functools import total_ordering
+from importlib import import_module
+import inspect
 import itertools
 import os
+from pkgutil import iter_modules
 import re
+import unittest
+
+from sparktestsupport import SPARK_HOME
+
 
 all_modules = []
 
 
+def _contain_unittests_class(module_name):
+"""
+Check if the module with specific module_name has classes are derived from 
unittest.TestCase.
+
+Such as:
+pyspark.tests.test_appsubmit, it will return True, because there is 
SparkSubmitTests which is
+included under the module of pyspark.tests.test_appsubmit, inherits from 
unittest.TestCase.
+``
+
+:param module_name: the complete name of module to be checked.
+:return: True if contains unittest classes otherwise False.
+ An ``ModuleNotFoundError`` will raise if the module is not found
+"""
+_module = import_module(module_name)

Review comment:
   ```Python
   Traceback (most recent call last):
 File "./dev/run-tests.py", line 32, in 
   import sparktestsupport.modules as modules
 File "/home/runner/work/spark/spark/dev/sparktestsupport/modules.py", line 
425, in 
   pyspark_core = Module(
 File "/home/runner/work/spark/spark/dev/sparktestsupport/modules.py", line 
122, in __init__
   discovered_goals = _discover_python_unittests(python_test_paths)
 File "/home/runner/work/spark/spark/dev/sparktestsupport/modules.py", line 
73, in _discover_python_unittests
   if _contain_unittests_class(module.name):
 File "/home/runner/work/spark/spark/dev/sparktestsupport/modules.py", line 
46, in _contain_unittests_class
   _module = import_module(module_name)
 File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
   return _bootstrap._gcd_import(name[level:], package, level)
   ModuleNotFoundError: No module named 'pyspark'
   ```
   
   It should be changed to path based.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


SparkQA commented on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862133377


   **[Test build #139858 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139858/testReport)**
 for PR 32926 at commit 
[`74dbce4`](https://github.com/apache/spark/commit/74dbce48db1ee466b8c1363990646eb8a9258a68).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862110490


   **[Test build #139858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139858/testReport)**
 for PR 32926 at commit 
[`74dbce4`](https://github.com/apache/spark/commit/74dbce48db1ee466b8c1363990646eb8a9258a68).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


SparkQA commented on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862133790


   **[Test build #139852 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139852/testReport)**
 for PR 32926 at commit 
[`03ccfe4`](https://github.com/apache/spark/commit/03ccfe48a403c6ade7b1d7d3dabd80c686f52f13).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862107624


   **[Test build #139852 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139852/testReport)**
 for PR 32926 at commit 
[`03ccfe4`](https://github.com/apache/spark/commit/03ccfe48a403c6ade7b1d7d3dabd80c686f52f13).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32049: [SPARK-34952][SQL] Aggregate (Min/Max/Count) push down for Parquet

2021-06-16 Thread GitBox


SparkQA commented on pull request #32049:
URL: https://github.com/apache/spark/pull/32049#issuecomment-862136599


   **[Test build #139832 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139832/testReport)**
 for PR 32049 at commit 
[`a5833ef`](https://github.com/apache/spark/commit/a5833ef7f551980ef48229932d9427a9e00af444).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32049: [SPARK-34952][SQL] Aggregate (Min/Max/Count) push down for Parquet

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #32049:
URL: https://github.com/apache/spark/pull/32049#issuecomment-862001794


   **[Test build #139832 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139832/testReport)**
 for PR 32049 at commit 
[`a5833ef`](https://github.com/apache/spark/commit/a5833ef7f551980ef48229932d9427a9e00af444).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32882: [WIP][SPARK-35724][SQL] Support traversal pruning in extendedResolutionRules and postHocResolutionRules

2021-06-16 Thread GitBox


SparkQA commented on pull request #32882:
URL: https://github.com/apache/spark/pull/32882#issuecomment-862139902


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44376/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExecutorM

2021-06-16 Thread GitBox


SparkQA commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862141748


   **[Test build #139842 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139842/testReport)**
 for PR 31992 at commit 
[`6c81e2d`](https://github.com/apache/spark/commit/6c81e2dd74b7f49b3ba30b8618d1a502db1246dc).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32924: [SPARK-35771][SQL] Format year-month intervals using type fields

2021-06-16 Thread GitBox


SparkQA commented on pull request #32924:
URL: https://github.com/apache/spark/pull/32924#issuecomment-862141991


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44374/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageE

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862041476


   **[Test build #139842 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139842/testReport)**
 for PR 31992 at commit 
[`6c81e2d`](https://github.com/apache/spark/commit/6c81e2dd74b7f49b3ba30b8618d1a502db1246dc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExecutorM

2021-06-16 Thread GitBox


SparkQA commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862142922


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44379/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-06-16 Thread GitBox


SparkQA commented on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-862143106


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44380/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


SparkQA commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862143456


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44375/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32693: [SPARK-35556][SQL][TESTS] Avoid log NoSuchMethodError when running multiple Hive version related tests

2021-06-16 Thread GitBox


SparkQA commented on pull request #32693:
URL: https://github.com/apache/spark/pull/32693#issuecomment-862144019


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44377/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kudhru commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


kudhru commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862144672


   Could anyone tell how to fix the failing **MiMa** tests?
   ```
   [error] spark-sketch: Failed binary compatibility check against 
org.apache.spark:spark-sketch_2.12:3.0.0! Found 1 potential problems (filtered 
1)
   3307
   [error]  * abstract method 
intersectInPlace(org.apache.spark.util.sketch.BloomFilter)org.apache.spark.util.sketch.BloomFilter
 in class org.apache.spark.util.sketch.BloomFilter is present only in current 
version
   3308
   [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]
   ```
   
https://github.com/kudhru/spark/runs/2836577393?check_suite_focus=true#step:9:3306
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32693: [SPARK-35556][SQL][TESTS] Avoid log NoSuchMethodError when running multiple Hive version related tests

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32693:
URL: https://github.com/apache/spark/pull/32693#issuecomment-862144814


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44377/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32049: [SPARK-34952][SQL] Aggregate (Min/Max/Count) push down for Parquet

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32049:
URL: https://github.com/apache/spark/pull/32049#issuecomment-862144815


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139832/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862144813






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExe

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862144812


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139842/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


SparkQA commented on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862145664


   **[Test build #139861 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139861/testReport)**
 for PR 32926 at commit 
[`004c6b2`](https://github.com/apache/spark/commit/004c6b205396899a240ee440ff0be79ea7618918).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #32924: [SPARK-35771][SQL] Format year-month intervals using type fields

2021-06-16 Thread GitBox


MaxGekk commented on pull request #32924:
URL: https://github.com/apache/spark/pull/32924#issuecomment-862148170


   +1, LGTM. Merging to master.
   Thank you, @sarutak .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #32924: [SPARK-35771][SQL] Format year-month intervals using type fields

2021-06-16 Thread GitBox


MaxGekk closed pull request #32924:
URL: https://github.com/apache/spark/pull/32924


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExecutorM

2021-06-16 Thread GitBox


SparkQA commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862150590


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44378/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a change in pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


Yikun commented on a change in pull request #32867:
URL: https://github.com/apache/spark/pull/32867#discussion_r652400691



##
File path: dev/sparktestsupport/modules.py
##
@@ -19,10 +19,28 @@
 import itertools
 import os
 import re
+import glob
+
+from sparktestsupport import SPARK_HOME
 
 all_modules = []
 
 
+def _discover_python_unittests(paths):
+if not paths:
+return set([])
+tests = set([])
+pyspark_path = os.path.join(SPARK_HOME, "python")
+for path in paths:
+# Discover the test*.py in every path
+files = glob.glob(os.path.join(pyspark_path, path, "test_*.py"))
+for f in files:
+# Convert 'pyspark_path/pyspark/tests/test_abc.py' to 
'pyspark.tests.test_abc'
+file2module = os.path.relpath(f, pyspark_path)[:-3].replace("/", 
".")

Review comment:
   > I wonder if we can import pyspark and go through the sub packages 
manually instead of going through the files ..
   
   Yep, it's more accurate, I change the rules from "test_* discover" to 
"unittest module discover", it will only discover the modules which are interit 
from unittest.TestCase. See my latest change: 
https://github.com/apache/spark/pull/32867/commits/9f4388a6ea8cb1f038b00a7f72ce92dd6bbb7845




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32926:
URL: https://github.com/apache/spark/pull/32926#issuecomment-862144810






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32693: [SPARK-35556][SQL][TESTS] Avoid log NoSuchMethodError when running multiple Hive version related tests

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32693:
URL: https://github.com/apache/spark/pull/32693#issuecomment-862144814


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/44377/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.log

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862144812


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139842/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32049: [SPARK-34952][SQL] Aggregate (Min/Max/Count) push down for Parquet

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32049:
URL: https://github.com/apache/spark/pull/32049#issuecomment-862144815


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139832/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


SparkQA commented on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862154577






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


AmplabJenkins commented on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862154622


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139862/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request #32927: [SPARK-35678][ML][FOLLOWUP] use softmax in NB

2021-06-16 Thread GitBox


zhengruifeng opened a new pull request #32927:
URL: https://github.com/apache/spark/pull/32927


   ### What changes were proposed in this pull request?
   use newly impled softmax function in NB
   
   
   ### Why are the changes needed?
   to simplify impl
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuite
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


AmplabJenkins removed a comment on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862154622


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139862/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a change in pull request #32883: [SPARK-35725][SQL] Support repartition expand partitions in AQE

2021-06-16 Thread GitBox


ulysses-you commented on a change in pull request #32883:
URL: https://github.com/apache/spark/pull/32883#discussion_r652459149



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ExpandShufflePartitions.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.exchange.{EnsureRequirements, 
REPARTITION_BY_COL, REPARTITION_BY_NONE, ShuffleExchangeLike, ShuffleOrigin}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A rule to expand the shuffle partitions based on the map output statistics, 
which can
+ * avoid data skew that hurt performance.
+ *
+ * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition 
should be expanded.
+ * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has 
data skew issue.
+ * the map side looks like:
+ *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
+ * and the reduce side looks like:
+ *  (without this rule) r1[m0-b1, m1-b1, m2-b1]
+ *  /  \
+ *   r0:[m0-b0, m1-b0, m2-b0], r1:[m0-b1], r2:[m1-b1], r3:[m2-b1], r4[m0-b2, 
m1-b2, m2-b2]
+ */
+object ExpandShufflePartitions extends CustomShuffleReaderRule {
+  override def supportedShuffleOrigins: Seq[ShuffleOrigin] =
+Seq(REPARTITION_BY_COL, REPARTITION_BY_NONE)

Review comment:
   > If people do df.repartition($"a"), we should make sure the output is 
hash partitioned by column a, isn't it
   
   yea, we should promise this.
   
   > I think it only makes sense if the repartition($"a") is added by the 
framework to optimize table insertion, not added by users.
   
   yea, we can use a new operator and shuffle origin to distinguish if it is 
added by user or framework. Then only optimize the operator added by framework.
   
   The origin idea of this PR is that add a config to let user decide if 
`repartition($"a")` can expand partitions which break the semantics, so user 
can use it in SQL query easily. After some thought maybe it's better to add a 
new hint that can support expand partitions ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32867: [SPARK-35721][PYTHON] Path level discover for python unittests

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #32867:
URL: https://github.com/apache/spark/pull/32867#issuecomment-862154577


   **[Test build #139862 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139862/testReport)**
 for PR 32867 at commit 
[`9f4388a`](https://github.com/apache/spark/commit/9f4388a6ea8cb1f038b00a7f72ce92dd6bbb7845).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32927: [SPARK-35678][ML][FOLLOWUP] use softmax in NB

2021-06-16 Thread GitBox


SparkQA commented on pull request #32927:
URL: https://github.com/apache/spark/pull/32927#issuecomment-862157119


   **[Test build #139863 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139863/testReport)**
 for PR 32927 at commit 
[`cbcb584`](https://github.com/apache/spark/commit/cbcb5845300b578c69a98752a60a87d57869cb2e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32693: [SPARK-35556][SQL][TESTS] Avoid log NoSuchMethodError when running multiple Hive version related tests

2021-06-16 Thread GitBox


SparkQA commented on pull request #32693:
URL: https://github.com/apache/spark/pull/32693#issuecomment-862158728


   **[Test build #139849 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139849/testReport)**
 for PR 32693 at commit 
[`01579a5`](https://github.com/apache/spark/commit/01579a5bc14a0777ecc2b0cd6c01bdae61709f02).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `case class MakeYMInterval(years: Expression, months: Expression)`
 * `case class DayTimeIntervalType(startField: Byte, endField: Byte) 
extends AtomicType `
 * `case class YearMonthIntervalType(startField: Byte, endField: Byte) 
extends AtomicType `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32693: [SPARK-35556][SQL][TESTS] Avoid log NoSuchMethodError when running multiple Hive version related tests

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #32693:
URL: https://github.com/apache/spark/pull/32693#issuecomment-862070391


   **[Test build #139849 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139849/testReport)**
 for PR 32693 at commit 
[`01579a5`](https://github.com/apache/spark/commit/01579a5bc14a0777ecc2b0cd6c01bdae61709f02).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExecutorM

2021-06-16 Thread GitBox


SparkQA commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862158957


   **[Test build #139850 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139850/testReport)**
 for PR 31992 at commit 
[`87a079e`](https://github.com/apache/spark/commit/87a079e5fa159dad343adcf8ac3f158ff4870f6b).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageE

2021-06-16 Thread GitBox


SparkQA removed a comment on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862073263


   **[Test build #139850 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139850/testReport)**
 for PR 31992 at commit 
[`87a079e`](https://github.com/apache/spark/commit/87a079e5fa159dad343adcf8ac3f158ff4870f6b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] q2w commented on pull request #32902: [SPARK-35754][CORE] Add config to put migrating blocks on disk only

2021-06-16 Thread GitBox


q2w commented on pull request #32902:
URL: https://github.com/apache/spark/pull/32902#issuecomment-862163632


   @Ngone51 Please have a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32882: [WIP][SPARK-35724][SQL] Support traversal pruning in extendedResolutionRules and postHocResolutionRules

2021-06-16 Thread GitBox


SparkQA commented on pull request #32882:
URL: https://github.com/apache/spark/pull/32882#issuecomment-862163712


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44376/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32924: [SPARK-35771][SQL] Format year-month intervals using type fields

2021-06-16 Thread GitBox


SparkQA commented on pull request #32924:
URL: https://github.com/apache/spark/pull/32924#issuecomment-862164583


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44374/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExecutorM

2021-06-16 Thread GitBox


SparkQA commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862166549


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44379/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExec

2021-06-16 Thread GitBox


AngersZh commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862166587


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-06-16 Thread GitBox


SparkQA commented on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-862167911


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44380/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


SparkQA commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862170065


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44375/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31992: [SPARK-34898][CORE] We should log SparkListenerExecutorMetricsUpdateEvent of `driver` appropriately when `spark.eventLog.logStageExecutorM

2021-06-16 Thread GitBox


SparkQA commented on pull request #31992:
URL: https://github.com/apache/spark/pull/31992#issuecomment-862173600


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44378/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32787: [SPARK-35618][SQL] Resolve star expressions in subqueries using outer query plans

2021-06-16 Thread GitBox


SparkQA commented on pull request #32787:
URL: https://github.com/apache/spark/pull/32787#issuecomment-862174258


   **[Test build #139855 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139855/testReport)**
 for PR 32787 at commit 
[`e0460c5`](https://github.com/apache/spark/commit/e0460c52c8d0d99eb06a618228ff7cd51b9c97ab).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32883: [SPARK-35725][SQL] Support repartition expand partitions in AQE

2021-06-16 Thread GitBox


cloud-fan commented on a change in pull request #32883:
URL: https://github.com/apache/spark/pull/32883#discussion_r652481286



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ExpandShufflePartitions.scala
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.exchange.{EnsureRequirements, 
REPARTITION_BY_COL, REPARTITION_BY_NONE, ShuffleExchangeLike, ShuffleOrigin}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A rule to expand the shuffle partitions based on the map output statistics, 
which can
+ * avoid data skew that hurt performance.
+ *
+ * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition 
should be expanded.
+ * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has 
data skew issue.
+ * the map side looks like:
+ *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
+ * and the reduce side looks like:
+ *  (without this rule) r1[m0-b1, m1-b1, m2-b1]
+ *  /  \
+ *   r0:[m0-b0, m1-b0, m2-b0], r1:[m0-b1], r2:[m1-b1], r3:[m2-b1], r4[m0-b2, 
m1-b2, m2-b2]
+ */
+object ExpandShufflePartitions extends CustomShuffleReaderRule {
+  override def supportedShuffleOrigins: Seq[ShuffleOrigin] =
+Seq(REPARTITION_BY_COL, REPARTITION_BY_NONE)

Review comment:
   We can start with the new operator first, and think of the user-facing 
API later. Maybe we don't need a user-facing API and the new operator can only 
be used by the table insertion optimizer rule.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >