回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread ckgppl_yan
No. I haven't installed Apple Developer Tools. I have installed Zulu OpenJDK 
11.0.17 manually.So I need to install Apple Developer Tools?- 原始邮件 -
发件人:Sean Owen 
收件人:ckgppl_...@sina.cn
抄送人:user 
主题:Re: Build SPARK from source with SBT failed
日期:2023年03月07日 20点58分

This says you don't have the java compiler installed. Did you install the Apple 
Developer Tools package?
On Tue, Mar 7, 2023 at 1:42 AM  wrote:
Hello,
I have tried to build SPARK source codes with SBT in my local dev environment 
(MacOS 13.2.1). But it reported following error:[error] java.io.IOException: 
Cannot run program 
"/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/bin/javac" (in 
directory "/Users/username/spark-remotemaster"): error=2, No such file or 
directory
[error] at 
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
[error] at 
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
[error] at 
scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75)
[error] at 
scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106)
I need to export JAVA_HOME to let it run successfully. But if I use maven then 
I don't need to export JAVA_HOME. I have also tried to build SPARK with SBT in 
Ubuntu X86_64 environment. It reported similar error.   

The official SPARK documentation  haven't mentioned 
export JAVA_HOME operation. So I think this is a bug which needs documentation 
or scripts change. Please correct me if I am wrong.
Thanks
Liang


回复:Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread ckgppl_yan
Oh, I got it. I thought SPARK can get local scala version.
- 原始邮件 -
发件人:Sean Owen 
收件人:ckgppl_...@sina.cn
抄送人:user 
主题:Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2
日期:2022年08月26日 21点08分

Spark is built with and ships with a copy of Scala. It doesn't use your local 
version.
On Fri, Aug 26, 2022 at 2:55 AM  wrote:
Hi all,
I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My OS 
scala version is 2.13.7.But when I run  spark-sumit then check the SparkUI, the 
web page shown that my scala version is 2.13.5.I used spark-shell, it also 
shown that my scala version is 2.13.5.Then I tried SPARK 3.2.2, it also shown 
that my scala version is 2.13.5.I checked the codes, it seems that SparkEnv got 
scala version from "scala.util.Properties.versionString".Not sure why it shown 
different scala version. Is it a bug or not?
Thanks
Liang

Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread ckgppl_yan
Hi all,
I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My OS 
scala version is 2.13.7.But when I run  spark-sumit then check the SparkUI, the 
web page shown that my scala version is 2.13.5.I used spark-shell, it also 
shown that my scala version is 2.13.5.Then I tried SPARK 3.2.2, it also shown 
that my scala version is 2.13.5.I checked the codes, it seems that SparkEnv got 
scala version from "scala.util.Properties.versionString".Not sure why it shown 
different scala version. Is it a bug or not?
Thanks
Liang

回复:Re: 回复:Re: 回复:Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

2022-03-16 Thread ckgppl_yan
Thanks, Jayesh and all. I finally get the correlation data frame using agg with 
list of functions.I think the list of functions which generate a column should 
be more detailed description.
Liang
- 原始邮件 -
发件人:"Lalwani, Jayesh" 
收件人:"ckgppl_...@sina.cn" , Enrico Minack 
, Sean Owen 
抄送人:user 
主题:Re: 回复:Re:  回复:Re: calculate 
correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 20点49分


No, You don’t need 30 dataframes and self joins. Convert a list of columns to a 
list of functions, and then pass the list of functions to the agg function
 
 

From: "ckgppl_...@sina.cn" 

Reply-To: "ckgppl_...@sina.cn" 

Date: Wednesday, March 16, 2022 at 8:16 AM

To: Enrico Minack , Sean Owen 

Cc: user 

Subject: [EXTERNAL] 回复:Re:
回复:Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame


 






CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless
 you can confirm the sender and know the content is safe.





 


Thanks, Enrico.


I just found that I need to group the data frame then calculate the 
correlation. So I will get a list of dataframe, not columns. 


So I used following solution:



1.  
use following codes to create a mutable data frame df_all. I used the first 
datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")

2.  
iterate all remaining datacol columns, create a temp data frame for this 
iteration. In this iteration, use df_all to join the temp data frame on the 
groupid
 column, then drop duplicated groupid column.

3.  
after the iteration, I will get the dataframe which contains all correlation 
data.









I need to verify the data to make sure it is valid.









Liang



- 
原始邮件 -

发件人:Enrico Minack 

收件人:ckgppl_...@sina.cn, Sean Owen 

抄送人:user 

主题:Re:
回复:Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

日期:2022年03月16日
 19点53分

 

If you have a list of Columns called `columns`, you can pass them to the `agg` 
method as:


 


  agg(columns.head, columns.tail: _*)


 


Enrico


 


 


Am 16.03.22 um 08:02 schrieb 
ckgppl_...@sina.cn:



Thanks, Sean. I modified the codes and have generated a list of columns.


I am working on convert a list of columns to a new data frame. It seems that 
there is no direct  API to do this.


 



- 
原始邮件 -

发件人:Sean Owen


收件人:ckgppl_...@sina.cn

抄送人:user


主题:Re: calculate correlation between multiple
 columns and one specific column after groupby the spark data frame

日期:2022年03月16日
 11点55分

 


Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list. 


On Tue, Mar 15, 2022, 10:30 PM  wrote:



Hi all,


 



I am stuck at  a correlation calculation problem. I have a dataframe like below:





groupid


datacol1


datacol2


datacol3


datacol*


corr_co






1


1


2


3


4


5




1


2


3


4


6


5




2


4


2


1


7


5




2


8


9


3


2


5




3


7


1


2


3


5




3


3


5


3


1


5






I want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.



So I used the following spark scala-api codes:


df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))



 


This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.


I have searched, it seems that functions.corr doesn't accept a List/Array 
parameter, and df.agg doesn't accept a function to be parameter.

So any  spark scala API codes can do this job efficiently?


 


Thanks



 


Liang







 




回复:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

2022-03-16 Thread ckgppl_yan
Thanks, Enrico.I just found that I need to group the data frame then calculate 
the correlation. So I will get a list of dataframe, not columns. So I used 
following solution:use following codes to create a mutable data frame df_all. I 
used the first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")iterate all 
remaining datacol columns, create a temp data frame for this iteration. In this 
iteration, use df_all to join the temp data frame on the groupid column, then 
drop duplicated groupid column.after the iteration, I will get the dataframe 
which contains all correlation data.
I need to verify the data to make sure it is valid.
Liang- 原始邮件 -
发件人:Enrico Minack 
收件人:ckgppl_...@sina.cn, Sean Owen 
抄送人:user 
主题:Re: 回复:Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 19点53分

If you have a list of Columns called
  `columns`, you can pass them to the `agg` method as:



  agg(columns.head, columns.tail: _*)





Enrico






Am 16.03.22 um 08:02 schrieb
  ckgppl_...@sina.cn:



  
  Thanks, Sean. I modified the codes and have generated a list
of columns.
  I am working on convert a list of columns to a new data
frame. It seems that there is no direct  API to do this.
  

  
  
- 原始邮件 -

  发件人:Sean Owen 

  收件人:ckgppl_...@sina.cn

  抄送人:user 

  主题:Re: calculate correlation between multiple columns and one
  specific column after groupby the spark data frame

  日期:2022年03月16日 11点55分





  Are you just trying to avoid writing the function call 30
times? Just put this in a loop over all the columns instead,
which adds a new corr col every time to a list. 




  On Tue, Mar 15, 2022, 10:30 PM

wrote:

  
  
Hi all,




  I am stuck at
   a correlation calculation problem. I have a
  dataframe like below:
  

  
  groupid
  datacol1
  datacol2
  datacol3
  datacol*
  corr_co

  

  1
  1
  2
  3
  4
  5


  1
  2
  3
  4
  6
  5


  2
  4
  2
  1
  7
  5


  2
  8
  9
  3
  2
  5


  3
  7
  1
  2
  3
  5


  3
  3
  5
  3
  1
  5

  

  
  I want to calculate the
  correlation between all datacol columns and
  corr_col column by each groupid.

So I used the following spark
scala-api codes:

df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

  


  This is very inefficient. If I
  have 30 data_col columns, I need to input 30 times
  functions.corr to calculate correlation.
  I have searched, it seems
  that functions.corr doesn't accept a List/Array
  parameter, and df.agg doesn't accept a function to
  be parameter.
  So any  spark scala API codes can 

回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

2022-03-16 Thread ckgppl_yan
Thanks, Sean. I modified the codes and have generated a list of columns.I am 
working on convert a list of columns to a new data frame. It seems that there 
is no direct  API to do this.
- 原始邮件 -
发件人:Sean Owen 
收件人:ckgppl_...@sina.cn
抄送人:user 
主题:Re: calculate correlation between multiple columns and one specific column 
after groupby the spark data frame
日期:2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list. 

On Tue, Mar 15, 2022, 10:30 PM   wrote:
Hi all,
I am stuck at  a correlation calculation problem. I have a dataframe like 
below:groupiddatacol1datacol2datacol3datacol*corr_co112345123465242175289325371235335315I
 want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.So I used the following spark scala-api 
codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.I have searched, it seems that 
functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept 
a function to be parameter.So any  spark scala API codes can do this job 
efficiently?
Thanks
Liang

calculate correlation between multiple columns and one specific column after groupby the spark data frame

2022-03-15 Thread ckgppl_yan
Hi all,
I am stuck at  a correlation calculation problem. I have a dataframe like 
below:groupiddatacol1datacol2datacol3datacol*corr_co112345123465242175289325371235335315I
 want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.So I used the following spark scala-api 
codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.I have searched, it seems that 
functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept 
a function to be parameter.So any  spark scala API codes can do this job 
efficiently?
Thanks
Liang