Re: Huge join performance issue

2013-04-27 Thread Jie Li
In order for us to understand the performance and identify the bottlenecks, could you do two things: 1) run the EXPLAIN command and share with us the output 2) share with us the hadoop job histories generated by the query. They can be collected following http://www.cs.duke.edu/starfish/tutorial/jo

Re: Huge join performance issue

2013-04-08 Thread Igor Tatarinov
Did you verify that all your available mappers are running (and reducers too)? If you have a small number of partitions with huge files, you might me underutilizing mappers (check that the files are being split). Also, it might be optimal to have a single "wave" of reducers by setting the number of

Re: Huge join performance issue

2013-04-06 Thread Gabi D
Thank you for your answer Nitin. Does anyone have additional insight into this? will be greatly appreciated. On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar wrote: > you dont really need subqueries to join the tables which have common > columns. Its an additional overhead > best way to filter your

Re: Huge join performance issue

2013-04-04 Thread Nitin Pawar
you dont really need subqueries to join the tables which have common columns. Its an additional overhead best way to filter your data and speed up your data processing is how you layout your data When you have larger table I will use partitioning and bucketing to trim down the data and improve the

Huge join performance issue

2013-04-04 Thread Gabi D
Hi all, I have two tables I need to join and then summarize. They are both huge (about 1B rows each, in the relevant partitions) and the query runs for over 2 hours creating 5T intermediate data. The current query looks like this: select t1.b,t1.c,t2.d,t2.e, count(*) from (select a,b,cfrom ta