This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new d7dd59a  [SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many 
projects in withColumn at SparkR and PySpark as well
d7dd59a is described below

commit d7dd59a6b4b734284b157af1191af337ceb1d256
Author: Hyukjin Kwon <gurwls...@apache.org>
AuthorDate: Wed Apr 3 08:30:24 2019 +0900

    [SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in 
withColumn at SparkR and PySpark as well
    
    ## What changes were proposed in this pull request?
    
    This is a followup of https://github.com/apache/spark/pull/23285. This PR 
adds the notes into PySpark and SparkR documentation as well.
    
    While I am here, I revised the doc a bit to make it sound a bit more neutral
    
    ## How was this patch tested?
    
    Manually built the doc and verified.
    
    Closes #24272 from HyukjinKwon/SPARK-26224.
    
    Authored-by: Hyukjin Kwon <gurwls...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 R/pkg/R/DataFrame.R                                        | 5 +++++
 python/pyspark/sql/dataframe.py                            | 5 +++++
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 8 ++++----
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 014ba28..774a2b2 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -2143,6 +2143,11 @@ setMethod("selectExpr",
 #' Return a new SparkDataFrame by adding a column or replacing the existing 
column
 #' that has the same name.
 #'
+#' Note: This method introduces a projection internally. Therefore, calling it 
multiple times,
+#' for instance, via loops in order to add multiple columns can generate big 
plans which
+#' can cause performance issues and even \code{StackOverflowException}. To 
avoid this,
+#' use \code{select} with the multiple columns at once.
+#'
 #' @param x a SparkDataFrame.
 #' @param colName a column name.
 #' @param col a Column expression (which must refer only to this 
SparkDataFrame), or an atomic
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 58d74f5..659cbc4 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1974,6 +1974,11 @@ class DataFrame(object):
         :param colName: string, name of the new column.
         :param col: a :class:`Column` expression for the new column.
 
+        .. note:: This method introduces a projection internally. Therefore, 
calling it multiple
+            times, for instance, via loops in order to add multiple columns 
can generate big
+            plans which can cause performance issues and even 
`StackOverflowException`.
+            To avoid this, use :func:`select` with the multiple columns at 
once.
+
         >>> df.withColumn('age2', df.age + 2).collect()
         [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
 
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 477e873..05015f8 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -2151,10 +2151,10 @@ class Dataset[T] private[sql](
    * `column`'s expression must only refer to attributes supplied by this 
Dataset. It is an
    * error to add a column that refers to some other Dataset.
    *
-   * Please notice that this method introduces a `Project`. This means that 
using it in loops in
-   * order to add several columns can generate very big plans which can cause 
huge performance
-   * issues and even `StackOverflowException`s. A much better alternative use 
`select` with the
-   * list of columns to add.
+   * @note this method introduces a projection internally. Therefore, calling 
it multiple times,
+   * for instance, via loops in order to add multiple columns can generate big 
plans which
+   * can cause performance issues and even `StackOverflowException`. To avoid 
this,
+   * use `select` with the multiple columns at once.
    *
    * @group untypedrel
    * @since 2.0.0


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to