回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

ckgppl_yan Wed, 16 Mar 2022 00:03:04 -0700

Thanks, Sean. I modified the codes and have generated a list of columns.I am 
working on convert a list of columns to a new data frame. It seems that there 
is no direct  API to do this.
----- 原始邮件 -----
发件人：Sean Owen <sro...@gmail.com>
收件人：ckgppl_...@sina.cn
抄送人：user <user@spark.apache.org>
主题：Re: calculate correlation between multiple columns and one specific column 
after groupby the spark data frame
日期：2022年03月16日 11点55分


Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list. 

On Tue, Mar 15, 2022, 10:30 PM  <ckgppl_...@sina.cn> wrote:
Hi all,
I am stuck at  a correlation calculation problem. I have a dataframe like 
below:groupiddatacol1datacol2datacol3datacol*corr_co000011234500001234650000242175000028932500003712350000335315I
 want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.So I used the following spark scala-api 
codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.I have searched, it seems that 
functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept 
a function to be parameter.So any  spark scala API codes can do this job 
efficiently?
Thanks
Liang

回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Reply via email to