Re: 回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

Lalwani, Jayesh Wed, 16 Mar 2022 05:49:59 -0700

No, You don’t need 30 dataframes and self joins. Convert a list of columns to a 
list of functions, and then pass the list of functions to the agg function

From: "ckgppl_...@sina.cn" <ckgppl_...@sina.cn>
Reply-To: "ckgppl_...@sina.cn" <ckgppl_...@sina.cn>
Date: Wednesday, March 16, 2022 at 8:16 AM
To: Enrico Minack <i...@enrico.minack.dev>, Sean Owen <sro...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: [EXTERNAL] 回复：Re: 回复：Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Thanks, Enrico.
I just found that I need to group the data frame then calculate the 
correlation. So I will get a list of dataframe, not columns.
So I used following solution:
1.       use following codes to create a mutable data frame df_all. I used the 
first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
2.       iterate all remaining datacol columns, create a temp data frame for 
this iteration. In this iteration, use df_all to join the temp data frame on 
the groupid column, then drop duplicated groupid column.
3.       after the iteration, I will get the dataframe which contains all 
correlation data.

I need to verify the data to make sure it is valid.

Liang
----- 原始邮件 -----
发件人：Enrico Minack <i...@enrico.minack.dev>
收件人：ckgppl_...@sina.cn, Sean Owen <sro...@gmail.com>
抄送人：user <user@spark.apache.org>
主题：Re: 回复：Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期：2022年03月16日 19点53分

If you have a list of Columns called `columns`, you can pass them to the `agg` 
method as:

  agg(columns.head, columns.tail: _*)

Enrico

Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>:
Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It seems that 
there is no direct  API to do this.

----- 原始邮件 -----
发件人：Sean Owen <sro...@gmail.com><mailto:sro...@gmail.com>
收件人：ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>
抄送人：user <user@spark.apache.org><mailto:user@spark.apache.org>
主题：Re: calculate correlation between multiple columns and one specific column 
after groupby the spark data frame
日期：2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list.
On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>> 
wrote:
Hi all,

I am stuck at  a correlation calculation problem. I have a dataframe like below:
groupid

datacol1

datacol2

datacol3

datacol*

corr_co

00001

1

2

3

4

5

00001

2

3

4

6

5

00002

4

2

1

7

5

00002

8

9

3

2

5

00003

7

1

2

3

5

00003

3

5

3

1

5

I want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.
So I used the following spark scala-api codes:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.

I have searched, it seems that functions.corr doesn't accept a List/Array 
parameter, and df.agg doesn't accept a function to be parameter.
So any  spark scala API codes can do this job efficiently?

Thanks

Liang

Re: 回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

Reply via email to