Re: Huge join performance issue

2013-04-27 Thread Jie Li
In order for us to understand the performance and identify the bottlenecks,
could you do two things:

1) run the EXPLAIN command and share with us the output
2) share with us the hadoop job histories generated by the query. They can
be collected following
http://www.cs.duke.edu/starfish/tutorial/job_history.html

Jie




On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov i...@decide.com wrote:

 Did you verify that all your available mappers are running (and reducers
 too)? If you have a small number of partitions with huge files, you might
 me underutilizing mappers (check that the files are being split). Also, it
 might be optimal to have a single wave of reducers by setting the number
 of reduce tasks appropriately.

 You might also consider optimizing a simpler query first:

 select t1.a, count(*)
 from (select a from table baseTB1 where ... ) t1  -- filter by partition
 as well
   join
 (select a  from baseTB2 where ...) t2-- filter by partition as
 well
 on t1.a=t2.a
 group by t1.a

 just to give you an idea how much overhead the extra columns are adding.
 If the columns are pretty big they could be causing the slowdown.

 igor
 decide.com


 On Sat, Apr 6, 2013 at 2:30 PM, Gabi D gabi...@gmail.com wrote:

 Thank you for your answer Nitin.
 Does anyone have additional insight into this? will be
 greatly appreciated.


 On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you dont really need subqueries to join the tables which have common
 columns. Its an additional overhead
 best way to filter your data and speed up your data processing is how
 you layout your data
 When you have larger table I will use partitioning and bucketing to trim
 down the data and improve the performances over joins

 distribute by is mainly used when you have your custom map reduce
 scripts and you want to use transform functionality in hive. I have not
 used it a lot so not sure on that part. also its helpful to write where
 clauses in join statements to reduce the dataset you want to join.



 On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote:

 Hi all,
 I have two tables I need to join and then summarize.
 They are both huge (about 1B rows each, in the relevant partitions) and
 the query runs for over 2 hours creating 5T intermediate data.

 The current query looks like this:

 select t1.b,t1.c,t2.d,t2.e, count(*)
 from (select a,b,cfrom table baseTB1 where ... ) t1  -- filter by
 partition as well
   join
 (select a,d,e from baseTB2 where ...) t2-- filter by
 partition as well
 on t1.a=t2.a
 group by t1.b,t1.c,t2.d,t2.e


 two questions:
 1. would joining baseTB1 and baseTB2 directly (instead of subqueries)
 be better in any way?
   (I know subqueries cause a lot of writes of the intermediate
 data but we also understand it's best to filter down the data that is being
 joined, which is more correct?)
 2. can I use 'distribute by ' and/or 'sort by' in some way that would
 help this? my understanding at the moment is that the problem lies in the
 fact
 that the reduces are on column a while the group by is on column b ...

 Any thoughts would be appreciated.




 --
 Nitin Pawar






Re: Huge join performance issue

2013-04-08 Thread Igor Tatarinov
Did you verify that all your available mappers are running (and reducers
too)? If you have a small number of partitions with huge files, you might
me underutilizing mappers (check that the files are being split). Also, it
might be optimal to have a single wave of reducers by setting the number
of reduce tasks appropriately.

You might also consider optimizing a simpler query first:

select t1.a, count(*)
from (select a from table baseTB1 where ... ) t1  -- filter by partition as
well
  join
(select a  from baseTB2 where ...) t2-- filter by partition as
well
on t1.a=t2.a
group by t1.a

just to give you an idea how much overhead the extra columns are adding. If
the columns are pretty big they could be causing the slowdown.

igor
decide.com


On Sat, Apr 6, 2013 at 2:30 PM, Gabi D gabi...@gmail.com wrote:

 Thank you for your answer Nitin.
 Does anyone have additional insight into this? will be greatly appreciated.


 On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you dont really need subqueries to join the tables which have common
 columns. Its an additional overhead
 best way to filter your data and speed up your data processing is how you
 layout your data
 When you have larger table I will use partitioning and bucketing to trim
 down the data and improve the performances over joins

 distribute by is mainly used when you have your custom map reduce scripts
 and you want to use transform functionality in hive. I have not used it a
 lot so not sure on that part. also its helpful to write where clauses in
 join statements to reduce the dataset you want to join.



 On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote:

 Hi all,
 I have two tables I need to join and then summarize.
 They are both huge (about 1B rows each, in the relevant partitions) and
 the query runs for over 2 hours creating 5T intermediate data.

 The current query looks like this:

 select t1.b,t1.c,t2.d,t2.e, count(*)
 from (select a,b,cfrom table baseTB1 where ... ) t1  -- filter by
 partition as well
   join
 (select a,d,e from baseTB2 where ...) t2-- filter by
 partition as well
 on t1.a=t2.a
 group by t1.b,t1.c,t2.d,t2.e


 two questions:
 1. would joining baseTB1 and baseTB2 directly (instead of subqueries)
 be better in any way?
   (I know subqueries cause a lot of writes of the intermediate
 data but we also understand it's best to filter down the data that is being
 joined, which is more correct?)
 2. can I use 'distribute by ' and/or 'sort by' in some way that would
 help this? my understanding at the moment is that the problem lies in the
 fact
 that the reduces are on column a while the group by is on column b ...

 Any thoughts would be appreciated.




 --
 Nitin Pawar





Re: Huge join performance issue

2013-04-06 Thread Gabi D
Thank you for your answer Nitin.
Does anyone have additional insight into this? will be greatly appreciated.


On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar nitinpawar...@gmail.com wrote:

 you dont really need subqueries to join the tables which have common
 columns. Its an additional overhead
 best way to filter your data and speed up your data processing is how you
 layout your data
 When you have larger table I will use partitioning and bucketing to trim
 down the data and improve the performances over joins

 distribute by is mainly used when you have your custom map reduce scripts
 and you want to use transform functionality in hive. I have not used it a
 lot so not sure on that part. also its helpful to write where clauses in
 join statements to reduce the dataset you want to join.



 On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote:

 Hi all,
 I have two tables I need to join and then summarize.
 They are both huge (about 1B rows each, in the relevant partitions) and
 the query runs for over 2 hours creating 5T intermediate data.

 The current query looks like this:

 select t1.b,t1.c,t2.d,t2.e, count(*)
 from (select a,b,cfrom table baseTB1 where ... ) t1  -- filter by
 partition as well
   join
 (select a,d,e from baseTB2 where ...) t2-- filter by
 partition as well
 on t1.a=t2.a
 group by t1.b,t1.c,t2.d,t2.e


 two questions:
 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be
 better in any way?
   (I know subqueries cause a lot of writes of the intermediate
 data but we also understand it's best to filter down the data that is being
 joined, which is more correct?)
 2. can I use 'distribute by ' and/or 'sort by' in some way that would
 help this? my understanding at the moment is that the problem lies in the
 fact
 that the reduces are on column a while the group by is on column b ...

 Any thoughts would be appreciated.




 --
 Nitin Pawar



Re: Huge join performance issue

2013-04-04 Thread Nitin Pawar
you dont really need subqueries to join the tables which have common
columns. Its an additional overhead
best way to filter your data and speed up your data processing is how you
layout your data
When you have larger table I will use partitioning and bucketing to trim
down the data and improve the performances over joins

distribute by is mainly used when you have your custom map reduce scripts
and you want to use transform functionality in hive. I have not used it a
lot so not sure on that part. also its helpful to write where clauses in
join statements to reduce the dataset you want to join.



On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote:

 Hi all,
 I have two tables I need to join and then summarize.
 They are both huge (about 1B rows each, in the relevant partitions) and
 the query runs for over 2 hours creating 5T intermediate data.

 The current query looks like this:

 select t1.b,t1.c,t2.d,t2.e, count(*)
 from (select a,b,cfrom table baseTB1 where ... ) t1  -- filter by
 partition as well
   join
 (select a,d,e from baseTB2 where ...) t2-- filter by partition
 as well
 on t1.a=t2.a
 group by t1.b,t1.c,t2.d,t2.e


 two questions:
 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be
 better in any way?
   (I know subqueries cause a lot of writes of the intermediate
 data but we also understand it's best to filter down the data that is being
 joined, which is more correct?)
 2. can I use 'distribute by ' and/or 'sort by' in some way that would help
 this? my understanding at the moment is that the problem lies in the fact
 that the reduces are on column a while the group by is on column b ...

 Any thoughts would be appreciated.




-- 
Nitin Pawar