Thanks, Sean. I modified the codes and have generated a list of columns.I am working on convert a list of columns to a new data frame. It seems that there is no direct API to do this. ----- 原始邮件 ----- 发件人:Sean Owen <sro...@gmail.com> 收件人:ckgppl_...@sina.cn 抄送人:user <user@spark.apache.org> 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame 日期:2022年03月16日 11点55分
Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn> wrote: Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below:groupiddatacol1datacol2datacol3datacol*corr_co000011234500001234650000242175000028932500003712350000335315I want to calculate the correlation between all datacol columns and corr_col column by each groupid.So I used the following spark scala-api codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.So any spark scala API codes can do this job efficiently? Thanks Liang