The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see 
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD


1) How to save result values of a query into a list ?

[CH:] val list: Array[Row] = output.collect, however get 1M records into an 
array seems not a good idea.



2) How to calculate variance of a column .Is there any efficient way?

[CH:] Not sure what’s that mean, but you can try 
output.select(‘colname).groupby ?

3) i will be running multiple queries on same data .Does spark has any way to 
optimize it ?

[CH:] val cachedRdd = output.cache(), and do whatever you need to do based on 
cachedRDD

4) how to save the output as key value pairs in a text file ?

[CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx)



 5) is there any way i can build decision kd tree using machine libraries of 
spark ?
[CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in mind 
SchemaRDD is just a normal RDD.

Cheng Hao

From: sanath kumar [mailto:sanath1...@gmail.com]
Sent: Wednesday, October 22, 2014 12:58 PM
To: user@spark.apache.org
Subject: spark sql query optimization , and decision tree building

Hi all ,


I have a large data in text files (1,000,000 lines) .Each line has 128 columns 
. Here each line is a feature and each column is a dimension.

I have converted the txt files in json format and able to run sql queries on 
json files using spark.

Now i am trying to build a k dimenstion decision tree (kd tree) with this large 
data .

My steps :
1) calculate variance of each column pick the column with maximum variance and 
make it as key of first node , and mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the 
process until you reach a point.

My sample code :

import sqlContext._

val people = sqlContext.jsonFile("siftoutput/")

people.printSchema()

people.registerTempTable("people")

val output = sqlContext.sql("SELECT * From people")

My Questions :

1) How to save result values of a query into a list ?

2) How to calculate variance of a column .Is there any efficient way?
3) i will be running multiple queries on same data .Does spark has any way to 
optimize it ?
4) how to save the output as key value pairs in a text file ?

5) is there any way i can build decision kd tree using machine libraries of 
spark ?

please help

Thanks,

Sanath

Reply via email to