Thank you very much ,

two more small questions :

1) val output = sqlContext.sql("SELECT * From people")
my output has 128 columns and  single row .
how to find the which column has the maximum value in a single row using
scala ?

2) as each row has 128 columns how to print each row into a text while with
space delimitation or as json using scala?

please reply

Thanks,
Sanath


On Wed, Oct 22, 2014 at 8:24 AM, Cheng, Hao <hao.ch...@intel.com> wrote:

>  The “output” variable is actually a SchemaRDD, it provides lots of DSL
> API, see
> http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
>
>
>
> 1) How to save result values of a query into a list ?
>
> [CH:] val list: Array[Row] = output.collect, however get 1M records into an 
> array seems not a good idea.
>
>
>
> 2) How to calculate variance of a column .Is there any efficient way?
>
> [CH:] Not sure what’s that mean, but you can try 
> output.select(‘colname).groupby ?
>
>
> 3) i will be running multiple queries on same data .Does spark has any way to 
> optimize it ?
>
> [CH:] val cachedRdd = output.cache(), and do whatever you need to do based on 
> cachedRDD
>
>
> 4) how to save the output as key value pairs in a text file ?
>
> [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx)
>
>
>
>  5) is there any way i can build decision kd tree using machine libraries of 
> spark ?
>
> [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in
> mind SchemaRDD is just a normal RDD.
>
>
>
> Cheng Hao
>
>
>
> *From:* sanath kumar [mailto:sanath1...@gmail.com]
> *Sent:* Wednesday, October 22, 2014 12:58 PM
> *To:* user@spark.apache.org
> *Subject:* spark sql query optimization , and decision tree building
>
>
>
> Hi all ,
>
>
>
> I have a large data in text files (1,000,000 lines) .Each line has 128
> columns . Here each line is a feature and each column is a dimension.
>
> I have converted the txt files in json format and able to run sql queries
> on json files using spark.
>
> Now i am trying to build a k dimenstion decision tree (kd tree) with this
> large data .
>
> My steps :
> 1) calculate variance of each column pick the column with maximum variance
> and make it as key of first node , and mean of the column as the value of
> the node.
> 2) based on the first node value split the data into 2 parts an repeat the
> process until you reach a point.
>
> My sample code :
>
> import sqlContext._
>
> val people = sqlContext.jsonFile("siftoutput/")
>
> people.printSchema()
>
> people.registerTempTable("people")
>
> val output = sqlContext.sql("SELECT * From people")
>
> My Questions :
>
> 1) How to save result values of a query into a list ?
>
> 2) How to calculate variance of a column .Is there any efficient way?
> 3) i will be running multiple queries on same data .Does spark has any way to 
> optimize it ?
> 4) how to save the output as key value pairs in a text file ?
>
> 5) is there any way i can build decision kd tree using machine libraries of 
> spark ?
>
> please help
>
> Thanks,
>
> Sanath
>
>
>

Reply via email to