Thank you very much , two more small questions :
1) val output = sqlContext.sql("SELECT * From people") my output has 128 columns and single row . how to find the which column has the maximum value in a single row using scala ? 2) as each row has 128 columns how to print each row into a text while with space delimitation or as json using scala? please reply Thanks, Sanath On Wed, Oct 22, 2014 at 8:24 AM, Cheng, Hao <hao.ch...@intel.com> wrote: > The “output” variable is actually a SchemaRDD, it provides lots of DSL > API, see > http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD > > > > 1) How to save result values of a query into a list ? > > [CH:] val list: Array[Row] = output.collect, however get 1M records into an > array seems not a good idea. > > > > 2) How to calculate variance of a column .Is there any efficient way? > > [CH:] Not sure what’s that mean, but you can try > output.select(‘colname).groupby ? > > > 3) i will be running multiple queries on same data .Does spark has any way to > optimize it ? > > [CH:] val cachedRdd = output.cache(), and do whatever you need to do based on > cachedRDD > > > 4) how to save the output as key value pairs in a text file ? > > [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx) > > > > 5) is there any way i can build decision kd tree using machine libraries of > spark ? > > [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in > mind SchemaRDD is just a normal RDD. > > > > Cheng Hao > > > > *From:* sanath kumar [mailto:sanath1...@gmail.com] > *Sent:* Wednesday, October 22, 2014 12:58 PM > *To:* user@spark.apache.org > *Subject:* spark sql query optimization , and decision tree building > > > > Hi all , > > > > I have a large data in text files (1,000,000 lines) .Each line has 128 > columns . Here each line is a feature and each column is a dimension. > > I have converted the txt files in json format and able to run sql queries > on json files using spark. > > Now i am trying to build a k dimenstion decision tree (kd tree) with this > large data . > > My steps : > 1) calculate variance of each column pick the column with maximum variance > and make it as key of first node , and mean of the column as the value of > the node. > 2) based on the first node value split the data into 2 parts an repeat the > process until you reach a point. > > My sample code : > > import sqlContext._ > > val people = sqlContext.jsonFile("siftoutput/") > > people.printSchema() > > people.registerTempTable("people") > > val output = sqlContext.sql("SELECT * From people") > > My Questions : > > 1) How to save result values of a query into a list ? > > 2) How to calculate variance of a column .Is there any efficient way? > 3) i will be running multiple queries on same data .Does spark has any way to > optimize it ? > 4) how to save the output as key value pairs in a text file ? > > 5) is there any way i can build decision kd tree using machine libraries of > spark ? > > please help > > Thanks, > > Sanath > > >