回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

ckgppl_yan Wed, 16 Mar 2022 05:15:21 -0700

Thanks, Enrico.I just found that I need to group the data frame then calculate 
the correlation. So I will get a list of dataframe, not columns. So I used 
following solution:use following codes to create a mutable data frame df_all. I 
used the first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")iterate all 
remaining datacol columns, create a temp data frame for this iteration. In this 
iteration, use df_all to join the temp data frame on the groupid column, then 
drop duplicated groupid column.after the iteration, I will get the dataframe 
which contains all correlation data.
I need to verify the data to make sure it is valid.
Liang----- 原始邮件 -----
发件人：Enrico Minack <i...@enrico.minack.dev>
收件人：ckgppl_...@sina.cn, Sean Owen <sro...@gmail.com>
抄送人：user <user@spark.apache.org>
主题：Re: 回复：Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期：2022年03月16日 19点53分


If you have a list of Columns called
      `columns`, you can pass them to the `agg` method as:
    

    
      agg(columns.head, columns.tail: _*)

    
    

    
    Enrico
    

    
    

    
    Am 16.03.22 um 08:02 schrieb
      ckgppl_...@sina.cn:

    
    
      
      Thanks, Sean. I modified the codes and have generated a list
        of columns.
      I am working on convert a list of columns to a new data
        frame. It seems that there is no direct  API to do this.
      

      
      
        ----- 原始邮件 -----

          发件人：Sean Owen <sro...@gmail.com>

          收件人：ckgppl_...@sina.cn

          抄送人：user <user@spark.apache.org>

          主题：Re: calculate correlation between multiple columns and one
          specific column after groupby the spark data frame

          日期：2022年03月16日 11点55分

        
        

        
          Are you just trying to avoid writing the function call 30
            times? Just put this in a loop over all the columns instead,
            which adds a new corr col every time to a list. 

            

            
              On Tue, Mar 15, 2022, 10:30 PM
                <ckgppl_...@sina.cn>
                wrote:

              
              
                Hi all,
                

                
                
                  I am stuck at
                           a correlation calculation problem. I have a
                          dataframe like below:
                  
                    
                      
                          groupid
                          datacol1
                          datacol2
                          datacol3
                          datacol*
                          corr_co
                        
                      
                        
                          00001
                          1
                          2
                          3
                          4
                          5
                        
                        
                          00001
                          2
                          3
                          4
                          6
                          5
                        
                        
                          00002
                          4
                          2
                          1
                          7
                          5
                        
                        
                          00002
                          8
                          9
                          3
                          2
                          5
                        
                        
                          00003
                          7
                          1
                          2
                          3
                          5
                        
                        
                          00003
                          3
                          5
                          3
                          1
                          5
                        
                      
                    
                  
                  I want to calculate the
                      correlation between all datacol columns and
                      corr_col column by each groupid.
                
                So I used the following spark
                    scala-api codes:
                
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
                
                  

                    
                  This is very inefficient. If I
                      have 30 data_col columns, I need to input 30 times
                      functions.corr to calculate correlation.
                  I have searched, it seems
                      that functions.corr doesn't accept a List/Array
                      parameter, and df.agg doesn't accept a function to
                      be parameter.
                  So any  spark scala API codes can do this job
                    efficiently?
                  

                  
                  Thanks
                
                

                
                Liang

回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

Reply via email to