Thanks, Enrico.I just found that I need to group the data frame then calculate 
the correlation. So I will get a list of dataframe, not columns. So I used 
following solution:use following codes to create a mutable data frame df_all. I 
used the first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")iterate all 
remaining datacol columns, create a temp data frame for this iteration. In this 
iteration, use df_all to join the temp data frame on the groupid column, then 
drop duplicated groupid column.after the iteration, I will get the dataframe 
which contains all correlation data.
I need to verify the data to make sure it is valid.
Liang----- 原始邮件 -----
发件人:Enrico Minack <i...@enrico.minack.dev>
收件人:ckgppl_...@sina.cn, Sean Owen <sro...@gmail.com>
抄送人:user <user@spark.apache.org>
主题:Re: 回复:Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 19点53分

If you have a list of Columns called
      `columns`, you can pass them to the `agg` method as:
    

    
      agg(columns.head, columns.tail: _*)

    
    

    
    Enrico
    

    
    

    
    Am 16.03.22 um 08:02 schrieb
      ckgppl_...@sina.cn:

    
    
      
      Thanks, Sean. I modified the codes and have generated a list
        of columns.
      I am working on convert a list of columns to a new data
        frame. It seems that there is no direct  API to do this.
      

      
      
        ----- 原始邮件 -----

          发件人:Sean Owen <sro...@gmail.com>

          收件人:ckgppl_...@sina.cn

          抄送人:user <user@spark.apache.org>

          主题:Re: calculate correlation between multiple columns and one
          specific column after groupby the spark data frame

          日期:2022年03月16日 11点55分

        
        

        
          Are you just trying to avoid writing the function call 30
            times? Just put this in a loop over all the columns instead,
            which adds a new corr col every time to a list. 

            

            
              On Tue, Mar 15, 2022, 10:30 PM
                <ckgppl_...@sina.cn>
                wrote:

              
              
                Hi all,
                

                
                
                  I am stuck at
                           a correlation calculation problem. I have a
                          dataframe like below:
                  
                    
                      
                          groupid
                          datacol1
                          datacol2
                          datacol3
                          datacol*
                          corr_co
                        
                      
                        
                          00001
                          1
                          2
                          3
                          4
                          5
                        
                        
                          00001
                          2
                          3
                          4
                          6
                          5
                        
                        
                          00002
                          4
                          2
                          1
                          7
                          5
                        
                        
                          00002
                          8
                          9
                          3
                          2
                          5
                        
                        
                          00003
                          7
                          1
                          2
                          3
                          5
                        
                        
                          00003
                          3
                          5
                          3
                          1
                          5
                        
                      
                    
                  
                  I want to calculate the
                      correlation between all datacol columns and
                      corr_col column by each groupid.
                
                So I used the following spark
                    scala-api codes:
                
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
                
                  

                    
                  This is very inefficient. If I
                      have 30 data_col columns, I need to input 30 times
                      functions.corr to calculate correlation.
                  I have searched, it seems
                      that functions.corr doesn't accept a List/Array
                      parameter, and df.agg doesn't accept a function to
                      be parameter.
                  So any  spark scala API codes can do this job
                    efficiently?
                  

                  
                  Thanks
                
                

                
                Liang
              
            
          
        
      
    
    

    

Reply via email to