The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
1) How to save result values of a query into a list ? [CH:] val list: Array[Row] = output.collect, however get 1M records into an array seems not a good idea. 2) How to calculate variance of a column .Is there any efficient way? [CH:] Not sure what’s that mean, but you can try output.select(‘colname).groupby ? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? [CH:] val cachedRdd = output.cache(), and do whatever you need to do based on cachedRDD 4) how to save the output as key value pairs in a text file ? [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx) 5) is there any way i can build decision kd tree using machine libraries of spark ? [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in mind SchemaRDD is just a normal RDD. Cheng Hao From: sanath kumar [mailto:sanath1...@gmail.com] Sent: Wednesday, October 22, 2014 12:58 PM To: user@spark.apache.org Subject: spark sql query optimization , and decision tree building Hi all , I have a large data in text files (1,000,000 lines) .Each line has 128 columns . Here each line is a feature and each column is a dimension. I have converted the txt files in json format and able to run sql queries on json files using spark. Now i am trying to build a k dimenstion decision tree (kd tree) with this large data . My steps : 1) calculate variance of each column pick the column with maximum variance and make it as key of first node , and mean of the column as the value of the node. 2) based on the first node value split the data into 2 parts an repeat the process until you reach a point. My sample code : import sqlContext._ val people = sqlContext.jsonFile("siftoutput/") people.printSchema() people.registerTempTable("people") val output = sqlContext.sql("SELECT * From people") My Questions : 1) How to save result values of a query into a list ? 2) How to calculate variance of a column .Is there any efficient way? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? 4) how to save the output as key value pairs in a text file ? 5) is there any way i can build decision kd tree using machine libraries of spark ? please help Thanks, Sanath