Also worthwhile using temporary tables for the joint query.
I can join a Hive table with any other JDBC accessed table from any other databases with DF and temporary tables // //Get the FACT table from Hive // var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM oraclehadoop.sales") // //Get the Dimension table from Oracle via JDBC // val c = HiveContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM sh.channels)", "user" -> "sh", "password" -> "xxx")) s.registerTempTable("t_s") c.registerTempTable("t_c") And do the join SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) FROM ( SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, SUM(t_s.AMOUNT_SOLD) AS TotalSales FROM t_s, t_t, t_c WHERE t_s.TIME_ID = t_t.TIME_ID AND t_s.CHANNEL_ID = t_c.CHANNEL_ID GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC ) rs LIMIT 1000 """ HiveContext.sql(sqltext).collect.foreach(println) HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: 15 February 2016 08:44 To: SRK <swethakasire...@gmail.com> Cc: user <user@spark.apache.org> Subject: Re: How to join an RDD with a hive table? Have you tried creating a DataFrame from the RDD and join with DataFrame which corresponds to the hive table ? On Sun, Feb 14, 2016 at 9:53 PM, SRK <swethakasire...@gmail.com <mailto:swethakasire...@gmail.com> > wrote: Hi, How to join an RDD with a hive table and retrieve only the records that I am interested. Suppose, I have an RDD that has 1000 records and there is a Hive table with 100,000 records, I should be able to join the RDD with the hive table by an Id and I should be able to load only those 1000 records from Hive table so that are no memory issues. Also, I was planning on storing the data in hive in the form of parquet files. Any help on this is greatly appreciated. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-an-RDD-with-a-hive-table-tp26225.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>