[GitHub] [spark] HyukjinKwon commented on a change in pull request #32835: [SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents

GitBox Wed, 09 Jun 2021 18:46:45 -0700


HyukjinKwon commented on a change in pull request #32835:
URL: https://github.com/apache/spark/pull/32835#discussion_r648791956




##########
File path: python/docs/source/development/ps_design.rst
##########
@@ -46,40 +46,40 @@ At the risk of overgeneralization, there are two API design 
approaches: the firs
 
 One example is value count (count by some key column), one of the most common 
operations in data science. pandas `DataFrame.value_count` returns the result 
in sorted order, which in 90% of the cases is what users prefer when exploring 
data, whereas Spark's does not sort, which is more desirable when building data 
pipelines, as users can accomplish the pandas behavior by adding an explicit 
`orderBy`.
 
-Similar to pandas, Koalas should also lean more towards the former, providing 
discoverable APIs for common data science tasks. In most cases, this principle 
is well taken care of by simply implementing pandas' APIs. However, there will 
be circumstances in which pandas' APIs don't address a specific need, e.g. 
plotting for big data.
+Similar to pandas, pandas APIs on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
 
 Provide well documented APIs, with examples
 -------------------------------------------
 
 All functions and parameters should be documented. Most functions should be 
documented with examples, because those are the easiest to understand than a 
blob of text explaining what the function does.
 
-A recommended way to add documentation is to start with the docstring of the 
corresponding function in PySpark or pandas, and adapt it for Koalas. If you 
are adding a new function, also add it to the API reference doc index page in 
`docs/source/reference` directory. The examples in docstring also improve our 
test coverage.
+A recommended way to add documentation is to start with the docstring of the 
corresponding function in PySpark or pandas, and adapt it for pandas APIs on 
Spark. If you are adding a new function, also add it to the API reference doc 
index page in `docs/source/reference` directory. The examples in docstring also 
improve our test coverage.
 
 Guardrails to prevent users from shooting themselves in the foot
 ----------------------------------------------------------------
 
-Certain operations in pandas are prohibitively expensive as data scales, and 
we don't want to give users the illusion that they can rely on such operations 
in Koalas. That is to say, methods implemented in Koalas should be safe to 
perform by default on large datasets. As a result, the following capabilities 
are not implemented in Koalas:
+Certain operations in pandas are prohibitively expensive as data scales, and 
we don't want to give users the illusion that they can rely on such operations 
in pandas APIs on Spark. That is to say, methods implemented in pandas APIs on 
Spark should be safe to perform by default on large datasets. As a result, the 
following capabilities are not implemented in pandas APIs on Spark:
 
 1. Capabilities that are fundamentally not parallelizable: e.g. imperatively 
looping over each element
 2. Capabilities that require materializing the entire working set in a single 
node's memory. This is why we do not implement `pandas.DataFrame.to_xarray 
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html>`_.
 Another example is the `_repr_html_` call caps the total number of records 
shown to a maximum of 1000, to prevent users from blowing up their driver node 
simply by typing the name of the DataFrame in a notebook.
 
 A few exceptions, however, exist. One common pattern with "big data science" 
is that while the initial dataset is large, the working set becomes smaller as 
the analysis goes deeper. For example, data scientists often perform 
aggregation on datasets and want to then convert the aggregated dataset to some 
local data structure. To help data scientists, we offer the following:
 
 - :func:`DataFrame.to_pandas`: returns a pandas DataFrame, koalas only
-- :func:`DataFrame.to_numpy`: returns a numpy array, works with both pandas 
and Koalas
+- :func:`DataFrame.to_numpy`: returns a numpy array, works with both pandas 
and pandas APIs on Spark
 
 Note that it is clear from the names that these functions return some local 
data structure that would require materializing data in a single node's memory. 
For these functions, we also explicitly document them with a warning note that 
the resulting data structure must be small.
 
 Be a lean API layer and move fast
 ---------------------------------
 
-Koalas is designed as an API overlay layer on top of Spark. The project should 
be lightweight, and most functions should be implemented as wrappers
-around Spark or pandas - the Koalas library is designed to be used only in the 
Spark's driver side in general.
-Koalas does not accept heavyweight implementations, e.g. execution engine 
changes.
+Pandas APIs on Spark is designed as an API overlay layer on top of Spark. The 
project should be lightweight, and most functions should be implemented as 
wrappers

Review comment:
       ```suggestion
   Pandas APIs on Spark are designed as an API overlay layer on top of Spark. 
The project should be lightweight, and most functions should be implemented as 
wrappers
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32835: [SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents

Reply via email to