OK. would it only query for the records that I want in hive as per filter or just load the entire table? My user table will have millions of records and I do not want to cause OOM errors by loading the entire table in memory.
On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Also worthwhile using temporary tables for the joint query. > > > > I can join a Hive table with any other JDBC accessed table from any other > databases with DF and temporary tables > > > > // > > //Get the FACT table from Hive > > // > > var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM > oraclehadoop.sales") > > > > // > > //Get the Dimension table from Oracle via JDBC > > // > > val c = HiveContext.load("jdbc", > > Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", > > "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM > sh.channels)", > > "user" -> "sh", > > "password" -> "xxx")) > > > > > > s.registerTempTable("t_s") > > c.registerTempTable("t_c") > > > > And do the join > > > > SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) > > FROM > > ( > > SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, > SUM(t_s.AMOUNT_SOLD) AS TotalSales > > FROM t_s, t_t, t_c > > WHERE t_s.TIME_ID = t_t.TIME_ID > > AND t_s.CHANNEL_ID = t_c.CHANNEL_ID > > GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC > > ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC > > ) rs > > LIMIT 1000 > > """ > > HiveContext.sql(sqltext).collect.foreach(println) > > > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* 15 February 2016 08:44 > *To:* SRK <swethakasire...@gmail.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: How to join an RDD with a hive table? > > > > Have you tried creating a DataFrame from the RDD and join with DataFrame > which corresponds to the hive table ? > > > > On Sun, Feb 14, 2016 at 9:53 PM, SRK <swethakasire...@gmail.com> wrote: > > Hi, > > How to join an RDD with a hive table and retrieve only the records that I > am > interested. Suppose, I have an RDD that has 1000 records and there is a > Hive > table with 100,000 records, I should be able to join the RDD with the hive > table by an Id and I should be able to load only those 1000 records from > Hive table so that are no memory issues. Also, I was planning on storing > the > data in hive in the form of parquet files. Any help on this is greatly > appreciated. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-an-RDD-with-a-hive-table-tp26225.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >