Hi all ,

I have a large data in text files (1,000,000 lines) .Each line has 128
columns . Here each line is a feature and each column is a dimension.

I have converted the txt files in json format and able to run sql queries
on json files using spark.

Now i am trying to build a k dimenstion decision tree (kd tree) with this
large data .

My steps :
1) calculate variance of each column pick the column with maximum variance
and make it as key of first node , and mean of the column as the value of
the node.
2) based on the first node value split the data into 2 parts an repeat the
process until you reach a point.

My sample code :

import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")

My Questions :

1) How to save result values of a query into a list ?
2) How to calculate variance of a column .Is there any efficient way?
3) i will be running multiple queries on same data .Does spark has any
way to optimize it ?
4) how to save the output as key value pairs in a text file ?

5) is there any way i can build decision kd tree using machine
libraries of spark ?

please help

Thanks,

Sanath

Reply via email to