回复:Re: Build SPARK from source with SBT failed
No. I haven't installed Apple Developer Tools. I have installed Zulu OpenJDK 11.0.17 manually.So I need to install Apple Developer Tools?- 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: Build SPARK from source with SBT failed 日期:2023年03月07日 20点58分 This says you don't have the java compiler installed. Did you install the Apple Developer Tools package? On Tue, Mar 7, 2023 at 1:42 AM wrote: Hello, I have tried to build SPARK source codes with SBT in my local dev environment (MacOS 13.2.1). But it reported following error:[error] java.io.IOException: Cannot run program "/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/bin/javac" (in directory "/Users/username/spark-remotemaster"): error=2, No such file or directory [error] at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) [error] at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) [error] at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75) [error] at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106) I need to export JAVA_HOME to let it run successfully. But if I use maven then I don't need to export JAVA_HOME. I have also tried to build SPARK with SBT in Ubuntu X86_64 environment. It reported similar error. The official SPARK documentation haven't mentioned export JAVA_HOME operation. So I think this is a bug which needs documentation or scripts change. Please correct me if I am wrong. Thanks Liang
回复:Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2
Oh, I got it. I thought SPARK can get local scala version. - 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2 日期:2022年08月26日 21点08分 Spark is built with and ships with a copy of Scala. It doesn't use your local version. On Fri, Aug 26, 2022 at 2:55 AM wrote: Hi all, I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My OS scala version is 2.13.7.But when I run spark-sumit then check the SparkUI, the web page shown that my scala version is 2.13.5.I used spark-shell, it also shown that my scala version is 2.13.5.Then I tried SPARK 3.2.2, it also shown that my scala version is 2.13.5.I checked the codes, it seems that SparkEnv got scala version from "scala.util.Properties.versionString".Not sure why it shown different scala version. Is it a bug or not? Thanks Liang
Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2
Hi all, I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My OS scala version is 2.13.7.But when I run spark-sumit then check the SparkUI, the web page shown that my scala version is 2.13.5.I used spark-shell, it also shown that my scala version is 2.13.5.Then I tried SPARK 3.2.2, it also shown that my scala version is 2.13.5.I checked the codes, it seems that SparkEnv got scala version from "scala.util.Properties.versionString".Not sure why it shown different scala version. Is it a bug or not? Thanks Liang
回复:Re: 回复:Re: 回复:Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
Thanks, Jayesh and all. I finally get the correlation data frame using agg with list of functions.I think the list of functions which generate a column should be more detailed description. Liang - 原始邮件 - 发件人:"Lalwani, Jayesh" 收件人:"ckgppl_...@sina.cn" , Enrico Minack , Sean Owen 抄送人:user 主题:Re: 回复:Re: 回复:Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame 日期:2022年03月16日 20点49分 No, You don’t need 30 dataframes and self joins. Convert a list of columns to a list of functions, and then pass the list of functions to the agg function From: "ckgppl_...@sina.cn" Reply-To: "ckgppl_...@sina.cn" Date: Wednesday, March 16, 2022 at 8:16 AM To: Enrico Minack , Sean Owen Cc: user Subject: [EXTERNAL] 回复:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Thanks, Enrico. I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns. So I used following solution: 1. use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation. df.groupby("groupid").agg(functions.corr("datacol1","corr_col") 2. iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid column, then drop duplicated groupid column. 3. after the iteration, I will get the dataframe which contains all correlation data. I need to verify the data to make sure it is valid. Liang - 原始邮件 - 发件人:Enrico Minack 收件人:ckgppl_...@sina.cn, Sean Owen 抄送人:user 主题:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame 日期:2022年03月16日 19点53分 If you have a list of Columns called `columns`, you can pass them to the `agg` method as: agg(columns.head, columns.tail: _*) Enrico Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn: Thanks, Sean. I modified the codes and have generated a list of columns. I am working on convert a list of columns to a new data frame. It seems that there is no direct API to do this. - 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame 日期:2022年03月16日 11点55分 Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM wrote: Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below: groupid datacol1 datacol2 datacol3 datacol* corr_co 1 1 2 3 4 5 1 2 3 4 6 5 2 4 2 1 7 5 2 8 9 3 2 5 3 7 1 2 3 5 3 3 5 3 1 5 I want to calculate the correlation between all datacol columns and corr_col column by each groupid. So I used the following spark scala-api codes: df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation. I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter. So any spark scala API codes can do this job efficiently? Thanks Liang
回复:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
Thanks, Enrico.I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns. So I used following solution:use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation. df.groupby("groupid").agg(functions.corr("datacol1","corr_col")iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid column, then drop duplicated groupid column.after the iteration, I will get the dataframe which contains all correlation data. I need to verify the data to make sure it is valid. Liang- 原始邮件 - 发件人:Enrico Minack 收件人:ckgppl_...@sina.cn, Sean Owen 抄送人:user 主题:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame 日期:2022年03月16日 19点53分 If you have a list of Columns called `columns`, you can pass them to the `agg` method as: agg(columns.head, columns.tail: _*) Enrico Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn: Thanks, Sean. I modified the codes and have generated a list of columns. I am working on convert a list of columns to a new data frame. It seems that there is no direct API to do this. - 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame 日期:2022年03月16日 11点55分 Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM wrote: Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below: groupid datacol1 datacol2 datacol3 datacol* corr_co 1 1 2 3 4 5 1 2 3 4 6 5 2 4 2 1 7 5 2 8 9 3 2 5 3 7 1 2 3 5 3 3 5 3 1 5 I want to calculate the correlation between all datacol columns and corr_col column by each groupid. So I used the following spark scala-api codes: df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation. I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter. So any spark scala API codes can
回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
Thanks, Sean. I modified the codes and have generated a list of columns.I am working on convert a list of columns to a new data frame. It seems that there is no direct API to do this. - 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame 日期:2022年03月16日 11点55分 Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM wrote: Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below:groupiddatacol1datacol2datacol3datacol*corr_co112345123465242175289325371235335315I want to calculate the correlation between all datacol columns and corr_col column by each groupid.So I used the following spark scala-api codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.So any spark scala API codes can do this job efficiently? Thanks Liang
calculate correlation between multiple columns and one specific column after groupby the spark data frame
Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below:groupiddatacol1datacol2datacol3datacol*corr_co112345123465242175289325371235335315I want to calculate the correlation between all datacol columns and corr_col column by each groupid.So I used the following spark scala-api codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.So any spark scala API codes can do this job efficiently? Thanks Liang