Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Sean Owen Tue, 15 Mar 2022 20:55:45 -0700

Are you just trying to avoid writing the function call 30 times? Just put
this in a loop over all the columns instead, which adds a new corr col
every time to a list.


On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn> wrote:

> Hi all,
>
> I am stuck at  a correlation calculation problem. I have a dataframe like
> below:
> groupiddatacol1datacol2datacol3datacol*corr_co
> 00001 1 2 3 4 5
> 00001 2 3 4 6 5
> 00002 4 2 1 7 5
> 00002 8 9 3 2 5
> 00003 7 1 2 3 5
> 00003 3 5 3 1 5
> I want to calculate the correlation between all datacol columns and
> corr_col column by each groupid.
> So I used the following spark scala-api codes:
>
> df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
>
> This is very inefficient. If I have 30 data_col columns, I need to input
> 30 times functions.corr to calculate correlation.
>
> I have searched, it seems that functions.corr doesn't accept a List/Array
> parameter, and df.agg doesn't accept a function to be parameter.
> So any  spark scala API codes can do this job efficiently?
>
> Thanks
>
> Liang
>

Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Reply via email to