Thanks Bejoy! That helps.

On Tue, Mar 20, 2012 at 12:10 AM, Bejoy Ks <bejoy...@yahoo.com> wrote:

> Hi Bruce
>       From my understanding, that formula is not for
> CombineFileInputFormat but for other basic Input Formats.
>
> I'd just brief you on CombineFileInputFormat to get things more clear.
>       In the default TextInputFormat every hdfs block is processed by a
> mapper. But if the files are small say 5Mb, spawning that may mappers would
> be an overkill for the job. So here we use Combine file input format,where
> one mapper process more than one small file and the min data size a mapper
> should process is defined by the min split size and the maximum data that a
> mapper can process is defined by max split size. ie data processed by a
> mapper is guaranteed to be not less than the min split size and not more
> than max split size specified.
>
> As you asked, if you are looking at more mappers in
> CombinedFileInputFormat then reduce the value of Max split Size. Bump it
> down to 32 mb (your block size) and just try it out. Or If you are looking
> at num mappers = num blocks, just change the input format in hive.
>
> By the way 32 mb is too small for a hdfs block size, you may hit NN memory
> issues pretty soon. Consider increasing it at least to 64 mb, though all
> larger clusters use either 128 or 256 Mb blocks.
>
> Hope it helps!..
>
> Regards
> Bejoy
>
>   ------------------------------
> *From:* Bruce Bian <weidong....@gmail.com>
> *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com>
> *Sent:* Monday, March 19, 2012 7:48 PM
> *Subject:* Re: how is number of mappers determined in mapside join?
>
> Hi Bejoy,
> Thanks for your reply.
> The function is from the book, Hadoop The Definitive Guide 2nd edition. On
> page 203 there is
> "The split size is calculated by the formula (see the computeSplitSize()
> method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize))
> by default:minimumSize < blockSize < maximumSize so the split size is
> blockSize."
>
> And I've actually used the HDFS block size to control the number of
> mappers launched before.
> So as to your response, do you mean that any value of the data between 1B
> and 256MB is OK for the mappers to process?
> Then the only way I can think of to increase the #mappers is to reduce the
> max split size.
>
> Regards,
> Bruce
>
> On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:
>
> Hi Bruce
>       In map side join the smaller table is loader in memory and hence the
> number of mappers is dependent only on the data on larger table. Say If
> CombineHiveInputFormat is used and we have our hdfs block size as 32 mb,
> min split size as 1B and max split size as 256 mb. Which means one mapper
> would be processing data chunks not less than 1B and not more than 256 MB.
> So based on that mappers would be triggered,
> so a possibility in your case
> mapper 1 - 200 MB
> mapper 2 - 120 MB
> mapper 3 - 140 MB
> Every mapper is processing data whose size id between 1B and 256 MB.
> Totally of 460 MB, your table size.
>
> I'm not sure of the formula you posted here, Can you point me to the
> document from which you got this?
>
> Regards
> Bejoy
>
>   ------------------------------
> *From:* Bruce Bian <weidong....@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Monday, March 19, 2012 2:42 PM
> *Subject:* how is number of mappers determined in mapside join?
>
> Hi there,
> when I'm executing the following queries in hive
>
> set hive.auto.convert.join = true;
> CREATE TABLE IDAP_ROOT as
> SELECT a.*,b.acnt_no
> FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON
> a.acnt_id=b.acnt_id
>
> the number of mappers to run in the mapside join is 3, how is it
> determined? When launching a job in hadoop mapreduce, i know it's
> determined by the function
> max(Min split size, min(Max split size, HDFS blockSize)) which in my
> configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB
> and 1.5MB respectively.
> Thus I thought the mappers to launch to be around 15, which is not the
> case.
>
> Thanks
> Bruce
>
>
>
>
>
>

Reply via email to