Hi all , I have a large data in text files (1,000,000 lines) .Each line has 128 columns . Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries on json files using spark. Now i am trying to build a k dimenstion decision tree (kd tree) with this large data . My steps : 1) calculate variance of each column pick the column with maximum variance and make it as key of first node , and mean of the column as the value of the node. 2) based on the first node value split the data into 2 parts an repeat the process until you reach a point. My sample code : import sqlContext._ val people = sqlContext.jsonFile("siftoutput/") people.printSchema() people.registerTempTable("people") val output = sqlContext.sql("SELECT * From people") My Questions : 1) How to save result values of a query into a list ? 2) How to calculate variance of a column .Is there any efficient way? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? 4) how to save the output as key value pairs in a text file ? 5) is there any way i can build decision kd tree using machine libraries of spark ? please help Thanks, Sanath