Thank you for your answer Nitin. Does anyone have additional insight into this? will be greatly appreciated.
On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar <nitinpawar...@gmail.com> wrote: > you dont really need subqueries to join the tables which have common > columns. Its an additional overhead > best way to filter your data and speed up your data processing is how you > layout your data > When you have larger table I will use partitioning and bucketing to trim > down the data and improve the performances over joins > > distribute by is mainly used when you have your custom map reduce scripts > and you want to use transform functionality in hive. I have not used it a > lot so not sure on that part. also its helpful to write where clauses in > join statements to reduce the dataset you want to join. > > > > On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <gabi...@gmail.com> wrote: > >> Hi all, >> I have two tables I need to join and then summarize. >> They are both huge (about 1B rows each, in the relevant partitions) and >> the query runs for over 2 hours creating 5T intermediate data. >> >> The current query looks like this: >> >> select t1.b,t1.c,t2.d,t2.e, count(*) >> from (select a,b,c from table baseTB1 where ... ) t1 -- filter by >> partition as well >> join >> (select a,d,e from baseTB2 where ...) t2 -- filter by >> partition as well >> on t1.a=t2.a >> group by t1.b,t1.c,t2.d,t2.e >> >> >> two questions: >> 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be >> better in any way? >> (I know subqueries cause a lot of writes of the intermediate >> data but we also understand it's best to filter down the data that is being >> joined, which is "more" correct?) >> 2. can I use 'distribute by ' and/or 'sort by' in some way that would >> help this? my understanding at the moment is that the problem lies in the >> fact >> that the reduces are on column a while the group by is on column b ... >> >> Any thoughts would be appreciated. >> >> > > > -- > Nitin Pawar >