No, You don’t need 30 dataframes and self joins. Convert a list of columns to a list of functions, and then pass the list of functions to the agg function
From: "ckgppl_...@sina.cn" <ckgppl_...@sina.cn> Reply-To: "ckgppl_...@sina.cn" <ckgppl_...@sina.cn> Date: Wednesday, March 16, 2022 at 8:16 AM To: Enrico Minack <i...@enrico.minack.dev>, Sean Owen <sro...@gmail.com> Cc: user <user@spark.apache.org> Subject: [EXTERNAL] 回复:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Thanks, Enrico. I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns. So I used following solution: 1. use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation. df.groupby("groupid").agg(functions.corr("datacol1","corr_col") 2. iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid column, then drop duplicated groupid column. 3. after the iteration, I will get the dataframe which contains all correlation data. I need to verify the data to make sure it is valid. Liang ----- 原始邮件 ----- 发件人:Enrico Minack <i...@enrico.minack.dev> 收件人:ckgppl_...@sina.cn, Sean Owen <sro...@gmail.com> 抄送人:user <user@spark.apache.org> 主题:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame 日期:2022年03月16日 19点53分 If you have a list of Columns called `columns`, you can pass them to the `agg` method as: agg(columns.head, columns.tail: _*) Enrico Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>: Thanks, Sean. I modified the codes and have generated a list of columns. I am working on convert a list of columns to a new data frame. It seems that there is no direct API to do this. ----- 原始邮件 ----- 发件人:Sean Owen <sro...@gmail.com><mailto:sro...@gmail.com> 收件人:ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn> 抄送人:user <user@spark.apache.org><mailto:user@spark.apache.org> 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame 日期:2022年03月16日 11点55分 Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>> wrote: Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below: groupid datacol1 datacol2 datacol3 datacol* corr_co 00001 1 2 3 4 5 00001 2 3 4 6 5 00002 4 2 1 7 5 00002 8 9 3 2 5 00003 7 1 2 3 5 00003 3 5 3 1 5 I want to calculate the correlation between all datacol columns and corr_col column by each groupid. So I used the following spark scala-api codes: df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation. I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter. So any spark scala API codes can do this job efficiently? Thanks Liang