Hi Edward, You said that there may be a "chain" of many map/reduce jobs. Is this realized by the class "Chain"? (org.apache.hadoop.mapred.lib) And I think it may save jobs if in the chain, output of one map/reduce job can be the input of many other jobs, it will be more effective. This means the "chain" have many branches, many jobs share the same input. The struct of this chain likes a tree. If not, it's a simple chain, it will be no effective, I think. So, what's your opinion? Regards, Zhou _____
From: Edward Capriolo [mailto:[email protected]] Sent: Tuesday, June 22, 2010 11:32 PM To: [email protected] Cc: [email protected] Subject: Re: hive Multi Table/File Inserts questions On Tue, Jun 22, 2010 at 2:55 AM, Zhou Shuaifeng <[email protected]> wrote: Hi, when I use Multi Table/File Inserts commands, some may be not more effective than run table insert commands separately.  For example,     from pokes    insert overwrite table pokes_count    select bar,count(foo) group by bar    insert overwrite table pokes_sum    select bar,sum(foo) group by bar;  To execute this, 2 map/reduce jobs is needed, which is not less than run the two command separately:     insert overwrite table pokes_count select bar,count(foo) from pokes group by bar;     insert overwrite table pokes_sum select bar,sum(foo) from pokes group by bar;   And the time taken is the same. But the first one seems only scan the table 'pokes' once, why still need 2 map/reduce jobs? And why the time taken couldnot be less? Is there any way to make it more effective?  Thanks a lot, Zhou This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!  Zhou, In the case of simple selects and a few tables you are not going to see the full benefit. Imagine some complex query was like this: from (  from (    select (table1 join table2 where x=6) t1  ) x  join table3 on x.col1 = t3.col1 ) y This could theoretically be a chain of thousands of map reduce jobs. Then you would save jobs and time by only evaluating once. Also you are only testing with 2 output tables. What happens with 10 or 20? Just curious. Regards, Edward
