Re: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-29 Thread Sourav Mazumder
Alternatively you can also try the ML library from System ML (
http://systemml.apache.org/) for covariance computation on Spark.

Regards,
Sourav

On Mon, Dec 28, 2015 at 11:29 PM, Sun, Rui  wrote:

> Spark does not support computing cov matrix  now. But there is a PR for
> it. Maybe you can try it:
> https://issues.apache.org/jira/browse/SPARK-11057
>
>
>
>
>
> *From:* zhangjp [mailto:592426...@qq.com]
> *Sent:* Tuesday, December 29, 2015 3:21 PM
> *To:* Felix Cheung; Andy Davidson; Yanbo Liang
> *Cc:* user
> *Subject:* 回复: how to use sparkR or spark MLlib load csv file on hdfs
> thencalculate covariance
>
>
>
>
>
> Now i have huge columns about 5k -20k, so if i want to Calculate
> covariance matrix ,which is the best method or common method ?
>
>
>
> -- 原始邮件 --
>
> *发件人**:* "Felix Cheung";;
>
> *发送时间**:* 2015年12月29日(星期二) 中午12:45
>
> *收件人**:* "Andy Davidson"; "zhangjp"<
> 592426...@qq.com>; "Yanbo Liang";
>
> *抄送**:* "user";
>
> *主题**:* Re: how to use sparkR or spark MLlib load csv file on hdfs
> thencalculate covariance
>
>
>
> Make sure you add the csv spark package as this example here so that the
> source parameter in R read.df would work:
>
>
>
>
> https://spark.apache.org/docs/latest/sparkr.html#from-data-sources
>
>
>
> _
> From: Andy Davidson 
> Sent: Monday, December 28, 2015 10:24 AM
> Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then
> calculate covariance
> To: zhangjp <592426...@qq.com>, Yanbo Liang 
> Cc: user 
>
> Hi Yanbo
>
>
>
> I use spark.csv to load my data set. I work with both Java and Python. I
> would recommend you print the first couple of rows and also print the
> schema to make sure your data is loaded as you expect. You might find the
> following code example helpful. You may need to programmatically set the
> schema depending on what you data looks like
>
>
>
>
>
> public class LoadTidyDataFrame {
>
> static  DataFrame fromCSV(SQLContext sqlContext, String file) {
>
> DataFrame df = sqlContext.read()
>
> .format("com.databricks.spark.csv")
>
> .option("inferSchema", "true")
>
> .option("header", "true")
>
> .load(file);
>
>
>
> return df;
>
> }
>
> }
>
>
>
>
>
>
>
> *From: *Yanbo Liang < yblia...@gmail.com>
> *Date: *Monday, December 28, 2015 at 2:30 AM
> *To: *zhangjp < 592426...@qq.com>
> *Cc: *"user @spark" < user@spark.apache.org>
> *Subject: *Re: how to use sparkR or spark MLlib load csv file on hdfs
> then calculate covariance
>
>
>
> Load csv file:
>
> df <- read.df(sqlContext, "file-path", source =
> "com.databricks.spark.csv", header = "true")
>
> Calculate covariance:
>
> cov <- cov(df, "col1", "col2")
>
>
>
> Cheers
>
> Yanbo
>
>
>
>
>
> 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>:
>
> hi  all,
>
> I want  to use sparkR or spark MLlib  load csv file on hdfs then
> calculate  covariance, how to do it .
>
> thks.
>
>
>
>
>


RE: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread Sun, Rui
Spark does not support computing cov matrix  now. But there is a PR for it. 
Maybe you can try it: https://issues.apache.org/jira/browse/SPARK-11057


From: zhangjp [mailto:592426...@qq.com]
Sent: Tuesday, December 29, 2015 3:21 PM
To: Felix Cheung; Andy Davidson; Yanbo Liang
Cc: user
Subject: 回复: how to use sparkR or spark MLlib load csv file on hdfs 
thencalculate covariance


Now i have huge columns about 5k -20k, so if i want to Calculate covariance 
matrix ,which is the best method or common method ?

-- 原始邮件 --
发件人: "Felix 
Cheung";>;
发送时间: 2015年12月29日(星期二) 中午12:45
收件人: "Andy 
Davidson">; 
"zhangjp"<592426...@qq.com>; "Yanbo 
Liang">;
抄送: "user">;
主题: Re: how to use sparkR or spark MLlib load csv file on hdfs thencalculate 
covariance

Make sure you add the csv spark package as this example here so that the source 
parameter in R read.df would work:


https://spark.apache.org/docs/latest/sparkr.html#from-data-sources

_
From: Andy Davidson 
>
Sent: Monday, December 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
To: zhangjp <592426...@qq.com>, Yanbo Liang 
>
Cc: user >

Hi Yanbo

I use spark.csv to load my data set. I work with both Java and Python. I would 
recommend you print the first couple of rows and also print the schema to make 
sure your data is loaded as you expect. You might find the following code 
example helpful. You may need to programmatically set the schema depending on 
what you data looks like



public class LoadTidyDataFrame {

static  DataFrame fromCSV(SQLContext sqlContext, String file) {

DataFrame df = sqlContext.read()

.format("com.databricks.spark.csv")

.option("inferSchema", "true")

.option("header", "true")

.load(file);



return df;

}

}



From: Yanbo Liang < yblia...@gmail.com>
Date: Monday, December 28, 2015 at 2:30 AM
To: zhangjp < 592426...@qq.com>
Cc: "user @spark" < user@spark.apache.org>
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance

Load csv file:
df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", 
header = "true")
Calculate covariance:
cov <- cov(df, "col1", "col2")

Cheers
Yanbo


2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>:
hi  all,
I want  to use sparkR or spark MLlib  load csv file on hdfs then calculate  
covariance, how to do it .
thks.