[ https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428596#comment-17428596 ]
Armand BERGES commented on SPARK-36858: --------------------------------------- Honestly, I feel a little dumb to not to have think at it earlier ... I fix our implementation and it is way better ! :) To be clear, our method is like that : {code:java} def withColumns(cols: Seq[String], columnTransform: String => Column, nameTransform: String => String = identity): DataFrame = { // See https://issues.apache.org/jira/browse/SPARK-36858 cols.foreach((colName: String) => df = df.withColumn(nameTransform(colName), columnTransform(colName))) df } {code} I think the method signature could easily be improved, and we could discuss about it. Based on your comment, this ticket could probably change in "Add a question in some Tutorial" to avoid some noobies to fell into the trap I mention. Of course, if Spark implements this method with some nice API it could be more easier to avoid this trap :) > Spark API to apply same function to multiple columns > ---------------------------------------------------- > > Key: SPARK-36858 > URL: https://issues.apache.org/jira/browse/SPARK-36858 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 2.4.8, 3.1.2 > Reporter: Armand BERGES > Priority: Minor > > Hi > My team and I have regularly need to apply the same function to multiple > columns at once. > For example, we want to remove all non alphanumerical characters to each > columns of our dataframes. > When we hit this use case first, some people in my team were using this kind > of code : > {code:java} > val colListToClean = .... ## Generate some list, could be very long. > val dfToClean: DataFrame = ... ## This is the dataframe we want to clean > def cleanFunction(colName: String): Column = ... ## Write some function to > manipulate column based on its name. > val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => > df.withColumn(colName, cleanFunction(colName)){code} > This kind of code when applied on a large set of columns overloaded our > driver (because a Dataframe is generated for each column to clean). > Based on this issue, we developed some code to add two functions : > * One to apply the same function to multiple columns > * One to rename multiple columns based on a Map. > > I wonder if your ever ask your team to add such kind of API ? If you did, had > you any kind of issue regarding the implementation ? If you didn't, is this > any idea you could add to Spark ? > Best regards, > > LvffY > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org