[DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Yikun Jiang Thu, 22 Apr 2021 00:33:37 -0700

Hi, all

*Background:*


Currently, there is a withColumns
<https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
method to help users/devs add/replace multiple columns at once.
But this method is private and isn't exposed as a public API interface,
that means it cannot be used by the user directly, and also it is not
supported in PySpark API.

As the dataframe user, I can only call withColumn() multiple times:

df.withColumn("key1", col("key1")).withColumn("key2",
col("key2")).withColumn("key3", col("key3"))

rather than:

df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])

Multiple calls bring some higher cost on developer experience and
performance. Especially in a PySpark related scenario, multiple calls mean
multiple py4j calls.

As mentioned
<https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
@Hyukjin, there were some previous discussions on  SPARK-12225
<https://issues.apache.org/jira/browse/SPARK-12225> [2] .

[1]
https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
[2] https://issues.apache.org/jira/browse/SPARK-12225

*Potential solution:*
Looks like there are 2 potential solutions if we want to support it:

1. Introduce a *withColumns *api for Scala/Python.
A separate public withColumns API will be added in scala/python api.

2. Make withColumn can receive *single col *and also the* list of cols*.
I did some experimental try on PySpark on
https://github.com/apache/spark/pull/32276
Just like Maciej said
<https://github.com/apache/spark/pull/32276#pullrequestreview-641280217> it
will bring some confusion with naming.


Thanks for your reading, feel free to reply if you have any other concerns
or suggestions!


Regards,
Yikun

[DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

Reply via email to