That's good, I've got it. Thanks a lot.
-----Original Message----- From: Edward Capriolo [mailto:[email protected]] Sent: Wednesday, June 23, 2010 9:21 AM To: [email protected] Cc: [email protected] Subject: Re: hive Multi Table/File Inserts questions On Tue, Jun 22, 2010 at 8:57 PM, Zhou Shuaifeng <[email protected]> wrote: > > Hi Edward, > You said that there may be a "chain" of many map/reduce jobs. Is this > realized by the class "Chain"? (org.apache.hadoop.mapred.lib) > > And I think it may save jobs if in the chain, output of one map/reduce job can be the input of many other jobs, it will be more effective. > This means the "chain" have many branches, many jobs share the same input. The struct of this chain likes a tree. > > If not, it's a simple chain, it will be no effective, I think. > > So, what's your opinion? > > Regards, > Zhou > > > ________________________________ > From: Edward Capriolo [mailto:[email protected]] > Sent: Tuesday, June 22, 2010 11:32 PM > To: [email protected] > Cc: [email protected] > Subject: Re: hive Multi Table/File Inserts questions > > > On Tue, Jun 22, 2010 at 2:55 AM, Zhou Shuaifeng <[email protected]> wrote: >> >> Hi, when I use Multi Table/File Inserts commands, some may be not more effective than run table insert commands separately. >>  >> For example, >>  >>    from pokes >>    insert overwrite table pokes_count    select >> bar,count(foo) group by bar    insert overwrite table pokes_sum  >>   select bar,sum(foo) group by bar;  To execute this, 2 >> map/reduce jobs is needed, which is not less than run the two command separately: >>  >>    insert overwrite table pokes_count select bar,count(foo) from >> pokes group by bar;     insert overwrite table pokes_sum select >> bar,sum(foo) from pokes group by bar;   And the time taken is the >> same. >> But the first one seems only scan the table 'pokes' once, why still need 2 map/reduce jobs? And why the time taken couldnot be less? >> Is there any way to make it more effective? >>  >> Thanks a lot, >> Zhou >> This e-mail and its attachments contain confidential information from >> HUAWEI, which is intended only for the person or entity whose address >> is listed above. Any use of the information contained herein in any >> way (including, but not limited to, total or partial disclosure, >> reproduction, or dissemination) by persons other than the intended >> recipient(s) is prohibited. If you receive this e-mail in error, >> please notify the sender by phone or email immediately and delete it! >>  > > Zhou, > > In the case of simple selects and a few tables you are not going to see the full benefit. > > Imagine some complex query was like this: > > from ( >  from ( >    select (table1 join table2 where x=6) t1  ) x  join table3 > on x.col1 = t3.col1 > ) y > > This could theoretically be a chain of thousands of map reduce jobs. Then you would save jobs and time by only evaluating once. > > Also you are only testing with 2 output tables. What happens with 10 or 20? Just curious. > > Regards, > Edward By "chain" I meant that a complex query could be multiple map reduce jobs (stages). Hive does not use ChainMapper or ChainReducer (that I know of). Depending on the query the output of the one stage could be input to the next. Let me give an example. http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) "there are two map/reduce jobs involved in computing the join." I hope my syntax is ok but this gets the point across. SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) insert overwrite table a_count select a.val,count(b.val) group by a.val insert overwrite table b_count select a.val,count(c.val) group by a.val = 4 jobs SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) insert overwrite table a_count select a.val,count(b.val) group by a.val; SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) insert overwrite table a_count select a.val,count(b.val) group by a.val; = 6 map/reduce jobs With more outputs you would get more saving.
