In order for us to understand the performance and identify the bottlenecks, could you do two things:
1) run the EXPLAIN command and share with us the output 2) share with us the hadoop job histories generated by the query. They can be collected following http://www.cs.duke.edu/starfish/tutorial/job_history.html Jie On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov <i...@decide.com> wrote: > Did you verify that all your available mappers are running (and reducers > too)? If you have a small number of partitions with huge files, you might > me underutilizing mappers (check that the files are being split). Also, it > might be optimal to have a single "wave" of reducers by setting the number > of reduce tasks appropriately. > > You might also consider optimizing a simpler query first: > > select t1.a, count(*) > from (select a from table baseTB1 where ... ) t1 -- filter by partition > as well > join > (select a from baseTB2 where ...) t2 -- filter by partition as > well > on t1.a=t2.a > group by t1.a > > just to give you an idea how much overhead the extra columns are adding. > If the columns are pretty big they could be causing the slowdown. > > igor > decide.com > > > On Sat, Apr 6, 2013 at 2:30 PM, Gabi D <gabi...@gmail.com> wrote: > >> Thank you for your answer Nitin. >> Does anyone have additional insight into this? will be >> greatly appreciated. >> >> >> On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> you dont really need subqueries to join the tables which have common >>> columns. Its an additional overhead >>> best way to filter your data and speed up your data processing is how >>> you layout your data >>> When you have larger table I will use partitioning and bucketing to trim >>> down the data and improve the performances over joins >>> >>> distribute by is mainly used when you have your custom map reduce >>> scripts and you want to use transform functionality in hive. I have not >>> used it a lot so not sure on that part. also its helpful to write where >>> clauses in join statements to reduce the dataset you want to join. >>> >>> >>> >>> On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <gabi...@gmail.com> wrote: >>> >>>> Hi all, >>>> I have two tables I need to join and then summarize. >>>> They are both huge (about 1B rows each, in the relevant partitions) and >>>> the query runs for over 2 hours creating 5T intermediate data. >>>> >>>> The current query looks like this: >>>> >>>> select t1.b,t1.c,t2.d,t2.e, count(*) >>>> from (select a,b,c from table baseTB1 where ... ) t1 -- filter by >>>> partition as well >>>> join >>>> (select a,d,e from baseTB2 where ...) t2 -- filter by >>>> partition as well >>>> on t1.a=t2.a >>>> group by t1.b,t1.c,t2.d,t2.e >>>> >>>> >>>> two questions: >>>> 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) >>>> be better in any way? >>>> (I know subqueries cause a lot of writes of the intermediate >>>> data but we also understand it's best to filter down the data that is being >>>> joined, which is "more" correct?) >>>> 2. can I use 'distribute by ' and/or 'sort by' in some way that would >>>> help this? my understanding at the moment is that the problem lies in the >>>> fact >>>> that the reduces are on column a while the group by is on column b ... >>>> >>>> Any thoughts would be appreciated. >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> >