Thanks Bejoy! That helps. On Tue, Mar 20, 2012 at 12:10 AM, Bejoy Ks <bejoy...@yahoo.com> wrote:
> Hi Bruce > From my understanding, that formula is not for > CombineFileInputFormat but for other basic Input Formats. > > I'd just brief you on CombineFileInputFormat to get things more clear. > In the default TextInputFormat every hdfs block is processed by a > mapper. But if the files are small say 5Mb, spawning that may mappers would > be an overkill for the job. So here we use Combine file input format,where > one mapper process more than one small file and the min data size a mapper > should process is defined by the min split size and the maximum data that a > mapper can process is defined by max split size. ie data processed by a > mapper is guaranteed to be not less than the min split size and not more > than max split size specified. > > As you asked, if you are looking at more mappers in > CombinedFileInputFormat then reduce the value of Max split Size. Bump it > down to 32 mb (your block size) and just try it out. Or If you are looking > at num mappers = num blocks, just change the input format in hive. > > By the way 32 mb is too small for a hdfs block size, you may hit NN memory > issues pretty soon. Consider increasing it at least to 64 mb, though all > larger clusters use either 128 or 256 Mb blocks. > > Hope it helps!.. > > Regards > Bejoy > > ------------------------------ > *From:* Bruce Bian <weidong....@gmail.com> > *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> > *Sent:* Monday, March 19, 2012 7:48 PM > *Subject:* Re: how is number of mappers determined in mapside join? > > Hi Bejoy, > Thanks for your reply. > The function is from the book, Hadoop The Definitive Guide 2nd edition. On > page 203 there is > "The split size is calculated by the formula (see the computeSplitSize() > method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize)) > by default:minimumSize < blockSize < maximumSize so the split size is > blockSize." > > And I've actually used the HDFS block size to control the number of > mappers launched before. > So as to your response, do you mean that any value of the data between 1B > and 256MB is OK for the mappers to process? > Then the only way I can think of to increase the #mappers is to reduce the > max split size. > > Regards, > Bruce > > On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks <bejoy...@yahoo.com> wrote: > > Hi Bruce > In map side join the smaller table is loader in memory and hence the > number of mappers is dependent only on the data on larger table. Say If > CombineHiveInputFormat is used and we have our hdfs block size as 32 mb, > min split size as 1B and max split size as 256 mb. Which means one mapper > would be processing data chunks not less than 1B and not more than 256 MB. > So based on that mappers would be triggered, > so a possibility in your case > mapper 1 - 200 MB > mapper 2 - 120 MB > mapper 3 - 140 MB > Every mapper is processing data whose size id between 1B and 256 MB. > Totally of 460 MB, your table size. > > I'm not sure of the formula you posted here, Can you point me to the > document from which you got this? > > Regards > Bejoy > > ------------------------------ > *From:* Bruce Bian <weidong....@gmail.com> > *To:* user@hive.apache.org > *Sent:* Monday, March 19, 2012 2:42 PM > *Subject:* how is number of mappers determined in mapside join? > > Hi there, > when I'm executing the following queries in hive > > set hive.auto.convert.join = true; > CREATE TABLE IDAP_ROOT as > SELECT a.*,b.acnt_no > FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON > a.acnt_id=b.acnt_id > > the number of mappers to run in the mapside join is 3, how is it > determined? When launching a job in hadoop mapreduce, i know it's > determined by the function > max(Min split size, min(Max split size, HDFS blockSize)) which in my > configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB > and 1.5MB respectively. > Thus I thought the mappers to launch to be around 15, which is not the > case. > > Thanks > Bruce > > > > > >