Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list.
On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn> wrote: > Hi all, > > I am stuck at a correlation calculation problem. I have a dataframe like > below: > groupiddatacol1datacol2datacol3datacol*corr_co > 00001 1 2 3 4 5 > 00001 2 3 4 6 5 > 00002 4 2 1 7 5 > 00002 8 9 3 2 5 > 00003 7 1 2 3 5 > 00003 3 5 3 1 5 > I want to calculate the correlation between all datacol columns and > corr_col column by each groupid. > So I used the following spark scala-api codes: > > df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) > > This is very inefficient. If I have 30 data_col columns, I need to input > 30 times functions.corr to calculate correlation. > > I have searched, it seems that functions.corr doesn't accept a List/Array > parameter, and df.agg doesn't accept a function to be parameter. > So any spark scala API codes can do this job efficiently? > > Thanks > > Liang >