Re: how is number of mappers determined in mapside join?

2012-03-20 Thread Bruce Bian
Thanks Bejoy! That helps.

On Tue, Mar 20, 2012 at 12:10 AM, Bejoy Ks  wrote:

> Hi Bruce
>   From my understanding, that formula is not for
> CombineFileInputFormat but for other basic Input Formats.
>
> I'd just brief you on CombineFileInputFormat to get things more clear.
>   In the default TextInputFormat every hdfs block is processed by a
> mapper. But if the files are small say 5Mb, spawning that may mappers would
> be an overkill for the job. So here we use Combine file input format,where
> one mapper process more than one small file and the min data size a mapper
> should process is defined by the min split size and the maximum data that a
> mapper can process is defined by max split size. ie data processed by a
> mapper is guaranteed to be not less than the min split size and not more
> than max split size specified.
>
> As you asked, if you are looking at more mappers in
> CombinedFileInputFormat then reduce the value of Max split Size. Bump it
> down to 32 mb (your block size) and just try it out. Or If you are looking
> at num mappers = num blocks, just change the input format in hive.
>
> By the way 32 mb is too small for a hdfs block size, you may hit NN memory
> issues pretty soon. Consider increasing it at least to 64 mb, though all
> larger clusters use either 128 or 256 Mb blocks.
>
> Hope it helps!..
>
> Regards
> Bejoy
>
>   --
> *From:* Bruce Bian 
> *To:* user@hive.apache.org; Bejoy Ks 
> *Sent:* Monday, March 19, 2012 7:48 PM
> *Subject:* Re: how is number of mappers determined in mapside join?
>
> Hi Bejoy,
> Thanks for your reply.
> The function is from the book, Hadoop The Definitive Guide 2nd edition. On
> page 203 there is
> "The split size is calculated by the formula (see the computeSplitSize()
> method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize))
> by default:minimumSize < blockSize < maximumSize so the split size is
> blockSize."
>
> And I've actually used the HDFS block size to control the number of
> mappers launched before.
> So as to your response, do you mean that any value of the data between 1B
> and 256MB is OK for the mappers to process?
> Then the only way I can think of to increase the #mappers is to reduce the
> max split size.
>
> Regards,
> Bruce
>
> On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks  wrote:
>
> Hi Bruce
>   In map side join the smaller table is loader in memory and hence the
> number of mappers is dependent only on the data on larger table. Say If
> CombineHiveInputFormat is used and we have our hdfs block size as 32 mb,
> min split size as 1B and max split size as 256 mb. Which means one mapper
> would be processing data chunks not less than 1B and not more than 256 MB.
> So based on that mappers would be triggered,
> so a possibility in your case
> mapper 1 - 200 MB
> mapper 2 - 120 MB
> mapper 3 - 140 MB
> Every mapper is processing data whose size id between 1B and 256 MB.
> Totally of 460 MB, your table size.
>
> I'm not sure of the formula you posted here, Can you point me to the
> document from which you got this?
>
> Regards
> Bejoy
>
>   --
> *From:* Bruce Bian 
> *To:* user@hive.apache.org
> *Sent:* Monday, March 19, 2012 2:42 PM
> *Subject:* how is number of mappers determined in mapside join?
>
> Hi there,
> when I'm executing the following queries in hive
>
> set hive.auto.convert.join = true;
> CREATE TABLE IDAP_ROOT as
> SELECT a.*,b.acnt_no
> FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON
> a.acnt_id=b.acnt_id
>
> the number of mappers to run in the mapside join is 3, how is it
> determined? When launching a job in hadoop mapreduce, i know it's
> determined by the function
> max(Min split size, min(Max split size, HDFS blockSize)) which in my
> configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB
> and 1.5MB respectively.
> Thus I thought the mappers to launch to be around 15, which is not the
> case.
>
> Thanks
> Bruce
>
>
>
>
>
>


Re: how is number of mappers determined in mapside join?

2012-03-19 Thread Bejoy Ks
Hi Bruce
      From my understanding, that formula is not for CombineFileInputFormat but 
for other basic Input Formats.

I'd just brief you on CombineFileInputFormat to get things more clear.
      In the default TextInputFormat every hdfs block is processed by a mapper. 
But if the files are small say 5Mb, spawning that may mappers would be an 
overkill for the job. So here we use Combine file input format,where one mapper 
process more than one small file and the min data size a mapper should process 
is defined by the min split size and the maximum data that a mapper can process 
is defined by max split size. ie data processed by a mapper is guaranteed to be 
not less than the min split size and not more than max split size specified.

As you asked, if you are looking at more mappers in CombinedFileInputFormat 
then reduce the value of Max split Size. Bump it down to 32 mb (your block 
size) and just try it out. Or If you are looking at num mappers = num blocks, 
just change the input format in hive.

By the way 32 mb is too small for a hdfs block size, you may hit NN memory 
issues pretty soon. Consider increasing it at least to 64 mb, though all larger 
clusters use either 128 or 256 Mb blocks.

Hope it helps!..

Regards
Bejoy  



 From: Bruce Bian 
To: user@hive.apache.org; Bejoy Ks  
Sent: Monday, March 19, 2012 7:48 PM
Subject: Re: how is number of mappers determined in mapside join?
 

Hi Bejoy,
Thanks for your reply.
The function is from the book, Hadoop The Definitive Guide 2nd edition. On page 
203 there is 
"The split size is calculated by the formula (see the computeSplitSize() method 
in FileInputFormat): max(minimumSize, min(maximumSize, blockSize))
by default:minimumSize < blockSize < maximumSize so the split size is 
blockSize."

And I've actually used the HDFS block size to control the number of mappers 
launched before.  
So as to your response, do you mean that any value of the data between 1B and 
256MB is OK for the mappers to process?
Then the only way I can think of to increase the #mappers is to reduce the max 
split size.

Regards,
Bruce


On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks  wrote:

Hi Bruce
>      In map side join the smaller table is loader in memory and hence the 
>number of mappers is dependent only on the data on larger table. Say If 
>CombineHiveInputFormat is used and we have our hdfs block size as 32 mb, min 
>split size as 1B and max split size as 256 mb. Which means one mapper would be 
>processing data chunks not less than 1B and not more than 256 MB. So based on 
>that mappers would be triggered, 
>so a possibility in your case
>mapper 1 - 200 MB
>mapper 2 - 120 MB
>mapper 3 - 140 MB
>Every mapper is processing data whose size id between 1B and 256 MB. Totally 
>of 460 MB, your table size.
>
>
>I'm not sure of the formula you posted here, Can you point me to the document 
>from which you got this?
>
>
>Regards
>Bejoy
>
>
>
>
> From: Bruce Bian 
>To: user@hive.apache.org 
>Sent: Monday, March 19, 2012 2:42 PM
>Subject: how is number of mappers determined in mapside join?
> 
>
>
>Hi there,
>when I'm executing the following queries in hive
>
>
>set hive.auto.convert.join = true;
>CREATE TABLE IDAP_ROOT as
>SELECT a.*,b.acnt_no
>FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
>
>
>the number of mappers to run in the mapside join is 3, how is it determined? 
>When launching a job in hadoop mapreduce, i know it's determined by the 
>function
>max(Min split size, min(Max split size, HDFS blockSize)) which in my 
>configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB 
>and 1.5MB respectively.
>Thus I thought the mappers to launch to be around 15, which is not the case.
>
>
>Thanks
>Bruce
>
>

Re: how is number of mappers determined in mapside join?

2012-03-19 Thread Bruce Bian
Hi Bejoy,
Thanks for your reply.
The function is from the book, Hadoop The Definitive Guide 2nd edition. On
page 203 there is
"The split size is calculated by the formula (see the computeSplitSize()
method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize))
by default:minimumSize < blockSize < maximumSize so the split size is
blockSize."

And I've actually used the HDFS block size to control the number of mappers
launched before.
So as to your response, do you mean that any value of the data between 1B
and 256MB is OK for the mappers to process?
Then the only way I can think of to increase the #mappers is to reduce the
max split size.

Regards,
Bruce

On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks  wrote:

> Hi Bruce
>   In map side join the smaller table is loader in memory and hence the
> number of mappers is dependent only on the data on larger table. Say If
> CombineHiveInputFormat is used and we have our hdfs block size as 32 mb,
> min split size as 1B and max split size as 256 mb. Which means one mapper
> would be processing data chunks not less than 1B and not more than 256 MB.
> So based on that mappers would be triggered,
> so a possibility in your case
> mapper 1 - 200 MB
> mapper 2 - 120 MB
> mapper 3 - 140 MB
> Every mapper is processing data whose size id between 1B and 256 MB.
> Totally of 460 MB, your table size.
>
> I'm not sure of the formula you posted here, Can you point me to the
> document from which you got this?
>
> Regards
> Bejoy
>
>   --
> *From:* Bruce Bian 
> *To:* user@hive.apache.org
> *Sent:* Monday, March 19, 2012 2:42 PM
> *Subject:* how is number of mappers determined in mapside join?
>
> Hi there,
> when I'm executing the following queries in hive
>
> set hive.auto.convert.join = true;
> CREATE TABLE IDAP_ROOT as
> SELECT a.*,b.acnt_no
> FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON
> a.acnt_id=b.acnt_id
>
> the number of mappers to run in the mapside join is 3, how is it
> determined? When launching a job in hadoop mapreduce, i know it's
> determined by the function
> max(Min split size, min(Max split size, HDFS blockSize)) which in my
> configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB
> and 1.5MB respectively.
> Thus I thought the mappers to launch to be around 15, which is not the
> case.
>
> Thanks
> Bruce
>
>
>


Re: how is number of mappers determined in mapside join?

2012-03-19 Thread Bejoy Ks
Hi Bruce
      In map side join the smaller table is loader in memory and hence the 
number of mappers is dependent only on the data on larger table. Say If 
CombineHiveInputFormat is used and we have our hdfs block size as 32 mb, min 
split size as 1B and max split size as 256 mb. Which means one mapper would be 
processing data chunks not less than 1B and not more than 256 MB. So based on 
that mappers would be triggered, 
so a possibility in your case
mapper 1 - 200 MB
mapper 2 - 120 MB
mapper 3 - 140 MB
Every mapper is processing data whose size id between 1B and 256 MB. Totally of 
460 MB, your table size.

I'm not sure of the formula you posted here, Can you point me to the document 
from which you got this?

Regards
Bejoy



 From: Bruce Bian 
To: user@hive.apache.org 
Sent: Monday, March 19, 2012 2:42 PM
Subject: how is number of mappers determined in mapside join?
 

Hi there,
when I'm executing the following queries in hive

set hive.auto.convert.join = true;
CREATE TABLE IDAP_ROOT as
SELECT a.*,b.acnt_no
FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id

the number of mappers to run in the mapside join is 3, how is it determined? 
When launching a job in hadoop mapreduce, i know it's determined by the function
max(Min split size, min(Max split size, HDFS blockSize)) which in my 
configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB and 
1.5MB respectively.
Thus I thought the mappers to launch to be around 15, which is not the case.

Thanks
Bruce