RE: hive Multi Table/File Inserts questions

Zhou Shuaifeng Tue, 22 Jun 2010 20:11:42 -0700

 That's good, I've got it.

Thanks a lot.



-----Original Message-----
From: Edward Capriolo [mailto:[email protected]] 
Sent: Wednesday, June 23, 2010 9:21 AM
To: [email protected]
Cc: [email protected]
Subject: Re: hive Multi Table/File Inserts questions

On Tue, Jun 22, 2010 at 8:57 PM, Zhou Shuaifeng <[email protected]>
wrote:
>
> Hi Edward,
> You said that there may be a "chain" of many map/reduce jobs. Is this 
> realized by the class "Chain"? (org.apache.hadoop.mapred.lib)
>
> And I think it may save jobs if in the chain, output of one map/reduce job
can be the input of many other jobs, it will be more effective.
> This means the "chain" have many branches, many jobs share the same input.
The struct of this chain likes a tree.
>
> If not, it's a simple chain, it will be no effective, I think.
>
> So, what's your opinion?
>
> Regards,
> Zhou
>
>
> ________________________________
> From: Edward Capriolo [mailto:[email protected]]
> Sent: Tuesday, June 22, 2010 11:32 PM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: hive Multi Table/File Inserts questions
>
>
> On Tue, Jun 22, 2010 at 2:55 AM, Zhou Shuaifeng <[email protected]>
wrote:
>>
>> Hi, when I use Multi Table/File Inserts commands, some may be not more
effective than run table insert commands separately.
>> Â
>> For example,
>> Â
>> Â Â Â  from pokes
>> Â Â Â  insert overwrite table pokes_count Â Â Â  select 
>> bar,count(foo) group by bar Â Â Â  insert overwrite table pokes_sum Â 
>> Â Â  select bar,sum(foo) group by bar; Â To execute this, 2 
>> map/reduce jobs is needed, which is not less than run the two command
separately:
>> Â
>> Â Â Â  insert overwrite table pokes_count select bar,count(foo) from 
>> pokesÂ group by bar; Â Â Â Â insert overwrite table pokes_sum select 
>> bar,sum(foo) from pokesÂ group by bar;Â Â Â And the time taken is the 
>> same.
>> But theÂ first one seems only scan the table 'pokes' once, why still need
2 map/reduce jobs? And why the time taken couldnot be less?
>> Is there any way to make it more effective?
>> Â
>> Thanks a lot,
>> Zhou
>> This e-mail and its attachments contain confidential information from 
>> HUAWEI, which is intended only for the person or entity whose address 
>> is listed above. Any use of the information contained herein in any 
>> way (including, but not limited to, total or partial disclosure, 
>> reproduction, or dissemination) by persons other than the intended
>> recipient(s) is prohibited. If you receive this e-mail in error, 
>> please notify the sender by phone or email immediately and delete it!
>> Â
>
> Zhou,
>
> In the case of simple selects and a few tables you are not going to see
the full benefit.
>
> Imagine some complex query was like this:
>
> from (
> Â  from (
> Â Â Â  select (table1 join table2 where x=6) t1 Â  ) x Â  join table3 
> on x.col1 = t3.col1
> ) y
>
> This could theoretically be a chain of thousands of map reduce jobs. Then
you would save jobs and time by only evaluating once.
>
> Also you are only testing with 2 output tables. What happens with 10 or
20? Just curious.
>
> Regards,
> Edward

By "chain" I meant that a complex query could be multiple map reduce jobs
(stages). Hive does not use ChainMapper or ChainReducer (that I know of).
Depending on the query the output of the one stage could be input to the
next.

Let me give an example.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins

 SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2) "there are two map/reduce jobs involved in computing the
join."

I hope my syntax is ok but this gets the point across.

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2) insert overwrite table a_count select a.val,count(b.val)
group by a.val insert overwrite table b_count select a.val,count(c.val)
group by a.val

= 4 jobs

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2) insert overwrite table a_count select a.val,count(b.val)
group by a.val;

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2) insert overwrite table a_count select a.val,count(b.val)
group by a.val;

= 6 map/reduce jobs

With more outputs you would get more saving.

RE: hive Multi Table/File Inserts questions

Reply via email to