Re: spark sql query optimization , and decision tree building

2014-10-27 Thread Yanbo Liang
If you want to calculate mean, variance, minimum, maximum and total count
for each columns, especially for features of machine learning, you can try
MultivariateOnlineSummarizer.
MultivariateOnlineSummarizer implements a numerically stable algorithm to
compute sample mean and variance by column in a online fashion. It support
both sparse and dense vector which can be constructed  from column
features. The time complexity is O(nnz) instead of O(n) for each column and
nnz represents the number of nunzero of each column.

2014-10-23 1:09 GMT+08:00 sanath kumar sanath1...@gmail.com:

 Thank you very much ,

 two more small questions :

 1) val output = sqlContext.sql(SELECT * From people)
 my output has 128 columns and  single row .
 how to find the which column has the maximum value in a single row using
 scala ?

 2) as each row has 128 columns how to print each row into a text while
 with space delimitation or as json using scala?

 please reply

 Thanks,
 Sanath


 On Wed, Oct 22, 2014 at 8:24 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The “output” variable is actually a SchemaRDD, it provides lots of DSL
 API, see
 http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD



 1) How to save result values of a query into a list ?

 [CH:] val list: Array[Row] = output.collect, however get 1M records into an 
 array seems not a good idea.



 2) How to calculate variance of a column .Is there any efficient way?

 [CH:] Not sure what’s that mean, but you can try 
 output.select(‘colname).groupby ?


 3) i will be running multiple queries on same data .Does spark has any way 
 to optimize it ?

 [CH:] val cachedRdd = output.cache(), and do whatever you need to do based 
 on cachedRDD


 4) how to save the output as key value pairs in a text file ?

 [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx)



  5) is there any way i can build decision kd tree using machine libraries of 
 spark ?

 [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in
 mind SchemaRDD is just a normal RDD.



 Cheng Hao



 *From:* sanath kumar [mailto:sanath1...@gmail.com]
 *Sent:* Wednesday, October 22, 2014 12:58 PM
 *To:* user@spark.apache.org
 *Subject:* spark sql query optimization , and decision tree building



 Hi all ,



 I have a large data in text files (1,000,000 lines) .Each line has 128
 columns . Here each line is a feature and each column is a dimension.

 I have converted the txt files in json format and able to run sql queries
 on json files using spark.

 Now i am trying to build a k dimenstion decision tree (kd tree) with this
 large data .

 My steps :
 1) calculate variance of each column pick the column with maximum
 variance and make it as key of first node , and mean of the column as the
 value of the node.
 2) based on the first node value split the data into 2 parts an repeat
 the process until you reach a point.

 My sample code :

 import sqlContext._

 val people = sqlContext.jsonFile(siftoutput/)

 people.printSchema()

 people.registerTempTable(people)

 val output = sqlContext.sql(SELECT * From people)

 My Questions :

 1) How to save result values of a query into a list ?

 2) How to calculate variance of a column .Is there any efficient way?
 3) i will be running multiple queries on same data .Does spark has any way 
 to optimize it ?
 4) how to save the output as key value pairs in a text file ?

 5) is there any way i can build decision kd tree using machine libraries of 
 spark ?

 please help

 Thanks,

 Sanath







RE: spark sql query optimization , and decision tree building

2014-10-22 Thread Cheng, Hao
The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see 
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD


1) How to save result values of a query into a list ?

[CH:] val list: Array[Row] = output.collect, however get 1M records into an 
array seems not a good idea.



2) How to calculate variance of a column .Is there any efficient way?

[CH:] Not sure what’s that mean, but you can try 
output.select(‘colname).groupby ?

3) i will be running multiple queries on same data .Does spark has any way to 
optimize it ?

[CH:] val cachedRdd = output.cache(), and do whatever you need to do based on 
cachedRDD

4) how to save the output as key value pairs in a text file ?

[CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx)



 5) is there any way i can build decision kd tree using machine libraries of 
spark ?
[CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in mind 
SchemaRDD is just a normal RDD.

Cheng Hao

From: sanath kumar [mailto:sanath1...@gmail.com]
Sent: Wednesday, October 22, 2014 12:58 PM
To: user@spark.apache.org
Subject: spark sql query optimization , and decision tree building

Hi all ,


I have a large data in text files (1,000,000 lines) .Each line has 128 columns 
. Here each line is a feature and each column is a dimension.

I have converted the txt files in json format and able to run sql queries on 
json files using spark.

Now i am trying to build a k dimenstion decision tree (kd tree) with this large 
data .

My steps :
1) calculate variance of each column pick the column with maximum variance and 
make it as key of first node , and mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the 
process until you reach a point.

My sample code :

import sqlContext._

val people = sqlContext.jsonFile(siftoutput/)

people.printSchema()

people.registerTempTable(people)

val output = sqlContext.sql(SELECT * From people)

My Questions :

1) How to save result values of a query into a list ?

2) How to calculate variance of a column .Is there any efficient way?
3) i will be running multiple queries on same data .Does spark has any way to 
optimize it ?
4) how to save the output as key value pairs in a text file ?

5) is there any way i can build decision kd tree using machine libraries of 
spark ?

please help

Thanks,

Sanath



Re: spark sql query optimization , and decision tree building

2014-10-22 Thread sanath kumar
Thank you very much ,

two more small questions :

1) val output = sqlContext.sql(SELECT * From people)
my output has 128 columns and  single row .
how to find the which column has the maximum value in a single row using
scala ?

2) as each row has 128 columns how to print each row into a text while with
space delimitation or as json using scala?

please reply

Thanks,
Sanath


On Wed, Oct 22, 2014 at 8:24 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The “output” variable is actually a SchemaRDD, it provides lots of DSL
 API, see
 http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD



 1) How to save result values of a query into a list ?

 [CH:] val list: Array[Row] = output.collect, however get 1M records into an 
 array seems not a good idea.



 2) How to calculate variance of a column .Is there any efficient way?

 [CH:] Not sure what’s that mean, but you can try 
 output.select(‘colname).groupby ?


 3) i will be running multiple queries on same data .Does spark has any way to 
 optimize it ?

 [CH:] val cachedRdd = output.cache(), and do whatever you need to do based on 
 cachedRDD


 4) how to save the output as key value pairs in a text file ?

 [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx)



  5) is there any way i can build decision kd tree using machine libraries of 
 spark ?

 [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in
 mind SchemaRDD is just a normal RDD.



 Cheng Hao



 *From:* sanath kumar [mailto:sanath1...@gmail.com]
 *Sent:* Wednesday, October 22, 2014 12:58 PM
 *To:* user@spark.apache.org
 *Subject:* spark sql query optimization , and decision tree building



 Hi all ,



 I have a large data in text files (1,000,000 lines) .Each line has 128
 columns . Here each line is a feature and each column is a dimension.

 I have converted the txt files in json format and able to run sql queries
 on json files using spark.

 Now i am trying to build a k dimenstion decision tree (kd tree) with this
 large data .

 My steps :
 1) calculate variance of each column pick the column with maximum variance
 and make it as key of first node , and mean of the column as the value of
 the node.
 2) based on the first node value split the data into 2 parts an repeat the
 process until you reach a point.

 My sample code :

 import sqlContext._

 val people = sqlContext.jsonFile(siftoutput/)

 people.printSchema()

 people.registerTempTable(people)

 val output = sqlContext.sql(SELECT * From people)

 My Questions :

 1) How to save result values of a query into a list ?

 2) How to calculate variance of a column .Is there any efficient way?
 3) i will be running multiple queries on same data .Does spark has any way to 
 optimize it ?
 4) how to save the output as key value pairs in a text file ?

 5) is there any way i can build decision kd tree using machine libraries of 
 spark ?

 please help

 Thanks,

 Sanath