Absolutely best to use sql here even in spark shell look at this example using SQL
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) println ("\nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) HiveContext.sql("use oraclehadoop") println ("\ncreating data set at "); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) val rs = HiveContext.sql( """ SELECT t.calendar_month_desc , c.channel_desc , SUM(s.amount_sold) AS TotalSales FROM sales s INNER JOIN times t ON s.time_id = t.time_id INNER JOIN channels c ON s.channel_id = c.channel_id GROUP BY t.calendar_month_desc, c.channel_desc """) rs.registerTempTable("tmp") println ("\nfirst query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) HiveContext.sql(""" SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales from tmp ORDER BY MONTH, CHANNEL LIMIT 5 """).collect.foreach(println) println ("\nsecond query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) HiveContext.sql(""" SELECT channel_desc AS CHANNEL, MAX(TotalSales) AS SALES FROM tmp GROUP BY channel_desc order by SALES DESC LIMIT 5 """).collect.foreach(println) println ("\nFinished at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) Now The same code using functional programming val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) println ("\nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) HiveContext.sql("use oraclehadoop") var s = HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID") val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC") val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC") println ("\ncreating data set at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) val rs = s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales")) println ("\nfirst query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) val rs1 = rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println) println ("\nsecond query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) val rs2 =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println) println ("\nFinished at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 3 March 2016 at 22:08, Mohammed Guller <moham...@glassbeam.com> wrote: > Why not use Spark SQL? > > > > Mohammed > > Author: Big Data Analytics with Spark > <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> > > > > *From:* Vikash Kumar [mailto:vikashsp...@gmail.com] > *Sent:* Wednesday, March 2, 2016 8:29 PM > *To:* user@spark.apache.org > *Subject:* convert SQL multiple Join in Spark > > > > I have to write or convert below SQL query into spark/scala. Anybody can > suggest how to implement this in Spark? > > SELECT a.PERSON_ID as RETAINED_PERSON_ID, > > a.PERSON_ID, > > a.PERSONTYPE, > > 'y' as HOLDOUT, > > d.LOCATION, > > b.HHID, > > a.AGE_OUTPUT as AGE, > > a.FIRST_NAME, > > d.STARTDATE, > > d.ENDDATE, > > 'Not In Campaign' as HH_TYPE > > FROM PERSON_MASTER_VIEW a > > INNER JOIN PERSON_ADDRESS_HH_KEYS b > > on a.PERSON_ID = b.PERSON_ID > > LEFT JOIN #Holdouts c > > on a.PERSON_ID = c.RETAINED_PERSON_ID > > INNER JOIN #Holdouts d > > on b.HHID = d.HHID > > WHERE c.RETAINED_PERSON_ID IS NULL and > a.PERSONTYPE IS NOT NULL > > GROUP BY a.PERSON_ID, a.PERSONTYPE, b.HHID, > a.AGE_OUTPUT, a.FIRST_NAME, d.LOCATION, d.STARTDATE, d.ENDDATE > >