[jira] [Commented] (SPARK-36858) Spark API to apply same function to multiple columns
[ https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428596#comment-17428596 ] Armand BERGES commented on SPARK-36858: --- Honestly, I feel a little dumb to not to have think at it earlier ... I fix our implementation and it is way better ! :) To be clear, our method is like that : {code:java} def withColumns(cols: Seq[String], columnTransform: String => Column, nameTransform: String => String = identity): DataFrame = { // See https://issues.apache.org/jira/browse/SPARK-36858 cols.foreach((colName: String) => df = df.withColumn(nameTransform(colName), columnTransform(colName))) df } {code} I think the method signature could easily be improved, and we could discuss about it. Based on your comment, this ticket could probably change in "Add a question in some Tutorial" to avoid some noobies to fell into the trap I mention. Of course, if Spark implements this method with some nice API it could be more easier to avoid this trap :) > Spark API to apply same function to multiple columns > > > Key: SPARK-36858 > URL: https://issues.apache.org/jira/browse/SPARK-36858 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: Armand BERGES >Priority: Minor > > Hi > My team and I have regularly need to apply the same function to multiple > columns at once. > For example, we want to remove all non alphanumerical characters to each > columns of our dataframes. > When we hit this use case first, some people in my team were using this kind > of code : > {code:java} > val colListToClean = ## Generate some list, could be very long. > val dfToClean: DataFrame = ... ## This is the dataframe we want to clean > def cleanFunction(colName: String): Column = ... ## Write some function to > manipulate column based on its name. > val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => > df.withColumn(colName, cleanFunction(colName)){code} > This kind of code when applied on a large set of columns overloaded our > driver (because a Dataframe is generated for each column to clean). > Based on this issue, we developed some code to add two functions : > * One to apply the same function to multiple columns > * One to rename multiple columns based on a Map. > > I wonder if your ever ask your team to add such kind of API ? If you did, had > you any kind of issue regarding the implementation ? If you didn't, is this > any idea you could add to Spark ? > Best regards, > > LvffY > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36858) Spark API to apply same function to multiple columns
[ https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425297#comment-17425297 ] Hyukjin Kwon commented on SPARK-36858: -- you could use var. e.g.) {code} var df = ... colListToClean.foreach { c => df = df.withColumn(c, func(...)) } {code} or actually what you did with foldLeft looks making sense too. What API do you have on your mind on this? > Spark API to apply same function to multiple columns > > > Key: SPARK-36858 > URL: https://issues.apache.org/jira/browse/SPARK-36858 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: Armand BERGES >Priority: Minor > > Hi > My team and I have regularly need to apply the same function to multiple > columns at once. > For example, we want to remove all non alphanumerical characters to each > columns of our dataframes. > When we hit this use case first, some people in my team were using this kind > of code : > {code:java} > val colListToClean = ## Generate some list, could be very long. > val dfToClean: DataFrame = ... ## This is the dataframe we want to clean > def cleanFunction(colName: String): Column = ... ## Write some function to > manipulate column based on its name. > val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => > df.withColumn(colName, cleanFunction(colName)){code} > This kind of code when applied on a large set of columns overloaded our > driver (because a Dataframe is generated for each column to clean). > Based on this issue, we developed some code to add two functions : > * One to apply the same function to multiple columns > * One to rename multiple columns based on a Map. > > I wonder if your ever ask your team to add such kind of API ? If you did, had > you any kind of issue regarding the implementation ? If you didn't, is this > any idea you could add to Spark ? > Best regards, > > LvffY > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36858) Spark API to apply same function to multiple columns
[ https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424988#comment-17424988 ] Armand BERGES commented on SPARK-36858: --- [~hyukjin.kwon] How would you do this ? >From my point, if you make a `df.withColumn` in a for loop, it will end in the >same execution plan (so probably in the same problem at the end, no ? > Spark API to apply same function to multiple columns > > > Key: SPARK-36858 > URL: https://issues.apache.org/jira/browse/SPARK-36858 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: Armand BERGES >Priority: Minor > > Hi > My team and I have regularly need to apply the same function to multiple > columns at once. > For example, we want to remove all non alphanumerical characters to each > columns of our dataframes. > When we hit this use case first, some people in my team were using this kind > of code : > {code:java} > val colListToClean = ## Generate some list, could be very long. > val dfToClean: DataFrame = ... ## This is the dataframe we want to clean > def cleanFunction(colName: String): Column = ... ## Write some function to > manipulate column based on its name. > val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => > df.withColumn(colName, cleanFunction(colName)){code} > This kind of code when applied on a large set of columns overloaded our > driver (because a Dataframe is generated for each column to clean). > Based on this issue, we developed some code to add two functions : > * One to apply the same function to multiple columns > * One to rename multiple columns based on a Map. > > I wonder if your ever ask your team to add such kind of API ? If you did, had > you any kind of issue regarding the implementation ? If you didn't, is this > any idea you could add to Spark ? > Best regards, > > LvffY > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36858) Spark API to apply same function to multiple columns
[ https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421935#comment-17421935 ] Hyukjin Kwon commented on SPARK-36858: -- Can't we simply do this in a for loop? > Spark API to apply same function to multiple columns > > > Key: SPARK-36858 > URL: https://issues.apache.org/jira/browse/SPARK-36858 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: Armand BERGES >Priority: Minor > > Hi > My team and I have regularly need to apply the same function to multiple > columns at once. > For example, we want to remove all non alphanumerical characters to each > columns of our dataframes. > When we hit this use case first, some people in my team were using this kind > of code : > {code:java} > val colListToClean = ## Generate some list, could be very long. > val dfToClean: DataFrame = ... ## This is the dataframe we want to clean > def cleanFunction(colName: String): Column = ... ## Write some function to > manipulate column based on its name. > val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => > df.withColumn(colName, cleanFunction(colName)){code} > This kind of code when applied on a large set of columns overloaded our > driver (because a Dataframe is generated for each column to clean). > Based on this issue, we developed some code to add two functions : > * One to apply the same function to multiple columns > * One to rename multiple columns based on a Map. > > I wonder if your ever ask your team to add such kind of API ? If you did, had > you any kind of issue regarding the implementation ? If you didn't, is this > any idea you could add to Spark ? > Best regards, > > LvffY > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org