Absolutely best to use sql here even in spark shell

look at this example using SQL

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
println ("\nStarted at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("use oraclehadoop")
println ("\ncreating data set at "); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
val rs = HiveContext.sql(
"""
SELECT    t.calendar_month_desc
        , c.channel_desc
        , SUM(s.amount_sold) AS TotalSales
FROM sales s
INNER JOIN times t
ON s.time_id = t.time_id
INNER JOIN channels c
ON s.channel_id = c.channel_id
GROUP BY t.calendar_month_desc, c.channel_desc
""")
rs.registerTempTable("tmp")
println ("\nfirst query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("""
SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
from tmp
ORDER BY MONTH, CHANNEL LIMIT 5
""").collect.foreach(println)
println ("\nsecond query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("""
SELECT channel_desc AS CHANNEL, MAX(TotalSales)  AS SALES
FROM tmp
GROUP BY channel_desc
order by SALES DESC LIMIT 5
""").collect.foreach(println)
println ("\nFinished at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)

Now The same code using functional programming

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
println ("\nStarted at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("use oraclehadoop")
var s =
HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
println ("\ncreating data set at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
val rs =
s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
println ("\nfirst query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
val rs1 =
rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
println ("\nsecond query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
val rs2
=rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
println ("\nFinished at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 3 March 2016 at 22:08, Mohammed Guller <moham...@glassbeam.com> wrote:

> Why not use Spark SQL?
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Vikash Kumar [mailto:vikashsp...@gmail.com]
> *Sent:* Wednesday, March 2, 2016 8:29 PM
> *To:* user@spark.apache.org
> *Subject:* convert SQL multiple Join in Spark
>
>
>
> I have to write or convert below SQL query into spark/scala. Anybody can 
> suggest how to implement this in Spark?
>
> SELECT a.PERSON_ID as RETAINED_PERSON_ID,
>
>                                 a.PERSON_ID,
>
>                                 a.PERSONTYPE,
>
>                                 'y' as HOLDOUT,
>
>                                 d.LOCATION,
>
>                                 b.HHID,
>
>                                 a.AGE_OUTPUT as AGE,
>
>                                 a.FIRST_NAME,
>
>                                 d.STARTDATE,
>
>                                 d.ENDDATE,
>
>                                 'Not In Campaign' as HH_TYPE
>
>                                 FROM PERSON_MASTER_VIEW a
>
>                                     INNER JOIN PERSON_ADDRESS_HH_KEYS b
>
>                                         on a.PERSON_ID = b.PERSON_ID
>
>                                     LEFT JOIN #Holdouts c
>
>                                         on a.PERSON_ID = c.RETAINED_PERSON_ID
>
>                                     INNER JOIN #Holdouts d
>
>                                         on b.HHID = d.HHID
>
>                                 WHERE c.RETAINED_PERSON_ID IS NULL and 
> a.PERSONTYPE IS NOT NULL
>
>                                 GROUP BY a.PERSON_ID, a.PERSONTYPE, b.HHID, 
> a.AGE_OUTPUT, a.FIRST_NAME, d.LOCATION, d.STARTDATE, d.ENDDATE
>
>

Reply via email to