Re: how is number of mappers determined in mapside join?

Bruce Bian Mon, 19 Mar 2012 07:18:48 -0700

Hi Bejoy,
Thanks for your reply.
The function is from the book, Hadoop The Definitive Guide 2nd edition. On
page 203 there is
"The split size is calculated by the formula (see the computeSplitSize()
method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize))
by default:minimumSize < blockSize < maximumSize so the split size is
blockSize."


And I've actually used the HDFS block size to control the number of mappers
launched before.
So as to your response, do you mean that any value of the data between 1B
and 256MB is OK for the mappers to process?
Then the only way I can think of to increase the #mappers is to reduce the
max split size.

Regards,
Bruce

On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:

> Hi Bruce
>       In map side join the smaller table is loader in memory and hence the
> number of mappers is dependent only on the data on larger table. Say If
> CombineHiveInputFormat is used and we have our hdfs block size as 32 mb,
> min split size as 1B and max split size as 256 mb. Which means one mapper
> would be processing data chunks not less than 1B and not more than 256 MB.
> So based on that mappers would be triggered,
> so a possibility in your case
> mapper 1 - 200 MB
> mapper 2 - 120 MB
> mapper 3 - 140 MB
> Every mapper is processing data whose size id between 1B and 256 MB.
> Totally of 460 MB, your table size.
>
> I'm not sure of the formula you posted here, Can you point me to the
> document from which you got this?
>
> Regards
> Bejoy
>
>   ------------------------------
> *From:* Bruce Bian <weidong....@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Monday, March 19, 2012 2:42 PM
> *Subject:* how is number of mappers determined in mapside join?
>
> Hi there,
> when I'm executing the following queries in hive
>
> set hive.auto.convert.join = true;
> CREATE TABLE IDAP_ROOT as
> SELECT a.*,b.acnt_no
> FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON
> a.acnt_id=b.acnt_id
>
> the number of mappers to run in the mapside join is 3, how is it
> determined? When launching a job in hadoop mapreduce, i know it's
> determined by the function
> max(Min split size, min(Max split size, HDFS blockSize)) which in my
> configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB
> and 1.5MB respectively.
> Thus I thought the mappers to launch to be around 15, which is not the
> case.
>
> Thanks
> Bruce
>
>
>

Re: how is number of mappers determined in mapside join?

Reply via email to