Re: Proper blocksize and io.sort.mb setting when using compressed LZO files

2010-09-27 Thread Srigurunath Chakravarthi
Ed,
 Your math is right - 1400 MB would be a good setting for io.sort.mb.

 fs.inmemorysize.mb - what Hadoop version are you using? I suspect that this 
may be deprecated. If it is supported, you can set it to 1400 MB too. I know it 
is recognized by .21 (and older versions).

 Increasing io.sort.factor to 100 or even 1000 is advisable.

 On the reduce side, to see if you are already hitting this limit, you can 
observe reduce task logs. The spill messages will tell you if spills are 
occurring every 10 files instead of occurring when the collected map output 
size reaches fs.inmemorysize.mb.
 On the map size I don't know how io.sort.factor gets used. It probably imposes 
a limit on number of spills (and forces additonal merge steps if spills exceed 
that limit).

 In any case setting it to a higher value such as 100 blindly is ok since it 
won't degrade performance.

Hope this helps,
Sriguru


- Original Message -
From: pig 
To: common-user@hadoop.apache.org 
Sent: Mon Sep 27 07:15:29 2010
Subject: Re: Proper blocksize and io.sort.mb setting when using compressed LZO 
files

HI Sriguru,

Thank you for the tips.  Just to clarify a few things.

Our machines have 32 GB of RAM.

I'm planning on setting each machine to run 12 mappers and 2 reducers with
the heap size set to 2048MB so total memory usage for the heap at 28GB.

If this is the case should io.sort.mb be set to 70% of 2048MB (so ~1400 MB)?

Also, I did not see a fs.inmemorysize.mb setting in any of the hadoop
configuration files.  Is that the correct setting I should be looking for?
Should this also be set to 70% of the heap size or does it need to share
with the io.sort.mb setting.

I assume if I'm bumping up io.sort.mb that much I also need to increase
io.sort.factor from the default of 10.  Is there a recommended relation
between these two?

Thank you for your help!

~Ed

On Sun, Sep 26, 2010 at 3:05 AM, Srigurunath Chakravarthi <
srig...@yahoo-inc.com> wrote:

> Ed,
>  Tuning io.sort.mb will be certainly worthwhile if you have enough RAM to
> allow for a higher Java heap per map task without risking swapping.
>
>  Similarly, you can decrease spills on the reduce side using
> fs.inmemorysize.mb.
>
> You can use the following thumb rules for tuning those two:
>
> - Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM
> across all processes (maps, reducers, TT, DN, other)
> - Set it small enough to avoid swap activity, but
> - Set it large enough to minimize disk spills.
> - Ensure that io.sort.factor is set large enough to allow full use of
> buffer space.
> - Balance space for output records (default 95%) & record meta-data (5%).
> Use io.sort.spill.percent and io.sort.record.percent
>
>  Your mileage may vary. We've seen job exec time improvements worth 1-3%
> via spill-avoidance for miscellaneous applications.
>
>  Your other option of running a map per 32MB or 64MB of input should give
> you better performance if your map task execution time is significant (i.e.,
> much larger than a few seconds) compared to the overhead of launching map
> tasks and reading input.
>
> Regards,
> Sriguru
>
> >-Original Message-
> >From: pig [mailto:hadoopn...@gmail.com]
> >Sent: Saturday, September 25, 2010 2:36 AM
> >To: common-user@hadoop.apache.org
> >Subject: Proper blocksize and io.sort.mb setting when using compressed
> >LZO files
> >
> >Hello,
> >
> >We just recently switched to using lzo compressed file input for our
> >hadoop
> >cluster using Kevin Weil's lzo library.  The files are pretty uniform
> >in
> >size at around 200MB compressed.  Our block size is 256MB.
> >Decompressed the
> >average LZO input file is around 1.0GB.  I noticed lots of our jobs are
> >now
> >spilling lots of data to disk.  We have almost 3x more spilled records
> >than
> >map input records for example.  I'm guessing this is because each
> >mapper is
> >getting a 200 MB lzo file which decompresses into 1GB of data per
> >mapper.
> >
> >Would you recommend solving this by reducing the block size to 64MB, or
> >even
> >32MB and then using the LZO indexer so that a single 200MB lzo file is
> >actually split among 3 or 4 mappers?  Would it be better to play with
> >the
> >io.sort.mb value?  Or, would it be best to play with both? Right now
> >the
> >io.sort.mb value is the default 200MB. Have other lzo users had to
> >adjust
> >their block size to compensate for the "expansion" of the data after
> >decompression?
> >
> >Thank you for any help!
> >
> >~Ed
>


RE: Proper blocksize and io.sort.mb setting when using compressed LZO files

2010-09-26 Thread Srigurunath Chakravarthi
Ed,
 Tuning io.sort.mb will be certainly worthwhile if you have enough RAM to allow 
for a higher Java heap per map task without risking swapping.

 Similarly, you can decrease spills on the reduce side using fs.inmemorysize.mb.

You can use the following thumb rules for tuning those two:

- Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM 
across all processes (maps, reducers, TT, DN, other)
- Set it small enough to avoid swap activity, but
- Set it large enough to minimize disk spills.
- Ensure that io.sort.factor is set large enough to allow full use of buffer 
space.
- Balance space for output records (default 95%) & record meta-data (5%). Use 
io.sort.spill.percent and io.sort.record.percent

 Your mileage may vary. We've seen job exec time improvements worth 1-3% via 
spill-avoidance for miscellaneous applications.

 Your other option of running a map per 32MB or 64MB of input should give you 
better performance if your map task execution time is significant (i.e., much 
larger than a few seconds) compared to the overhead of launching map tasks and 
reading input.

Regards,
Sriguru

>-Original Message-
>From: pig [mailto:hadoopn...@gmail.com]
>Sent: Saturday, September 25, 2010 2:36 AM
>To: common-user@hadoop.apache.org
>Subject: Proper blocksize and io.sort.mb setting when using compressed
>LZO files
>
>Hello,
>
>We just recently switched to using lzo compressed file input for our
>hadoop
>cluster using Kevin Weil's lzo library.  The files are pretty uniform
>in
>size at around 200MB compressed.  Our block size is 256MB.
>Decompressed the
>average LZO input file is around 1.0GB.  I noticed lots of our jobs are
>now
>spilling lots of data to disk.  We have almost 3x more spilled records
>than
>map input records for example.  I'm guessing this is because each
>mapper is
>getting a 200 MB lzo file which decompresses into 1GB of data per
>mapper.
>
>Would you recommend solving this by reducing the block size to 64MB, or
>even
>32MB and then using the LZO indexer so that a single 200MB lzo file is
>actually split among 3 or 4 mappers?  Would it be better to play with
>the
>io.sort.mb value?  Or, would it be best to play with both? Right now
>the
>io.sort.mb value is the default 200MB. Have other lzo users had to
>adjust
>their block size to compensate for the "expansion" of the data after
>decompression?
>
>Thank you for any help!
>
>~Ed


Re: Question regarding Shuffle

2010-07-01 Thread Srigurunath Chakravarthi
Ahmad,
 Yes, http is used. Reducer tasks fetch map output via http from TaskTrackers.

Sriguru 

- Original Message -
From: Ahmad Shahzad 
To: common-user@hadoop.apache.org 
Cc: Ahmad Shahzad 
Sent: Thu Jul 01 07:54:51 2010
Subject: Question regarding Shuffle

Hi ALL,
   Does anyone know HOW Mappers pass their output to Reducers? Is
HTTP used for that or there is some other communication protocol used for
transferring the output of Mappers to Reducers.

Regards,
Ahmad Shahzad


RE: In which configuration file to configure the "fs.inmemory.size.mb" parameter?

2010-07-01 Thread Srigurunath Chakravarthi
Carp,
 IMHO, .20.x has it. fs.inmemory.size.mb is the reduce-side equivalent of 
io.sort.mb. In the reducer tasks, intermediate map output is collected into a 
buffer (who size is governed by this parameter's value), and data is flushed 
into files as (partially) sorted KVs. 

 These files will be re-merged if we end up with more than io.sort.factor 
number of files, else KVs will be served out of these files to the reduce 
function directly.

 I don't know where in the code it is though, sorry.

cheers,
Sriguru


>-Original Message-
>From: Yu Li [mailto:car...@gmail.com]
>Sent: Thursday, July 01, 2010 1:12 PM
>To: common-user@hadoop.apache.org
>Subject: In which configuration file to configure the
>"fs.inmemory.size.mb" parameter?
>
>Hi all,
>
>I looked through the "Cluster Setup" guide under link
>http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html and
>found there's a "fs.inmemory.size.mb" parameter for specifying memory
>allocated for the in-memory file-system used to merge map-outputs at
>the reduces, and this parameter is set in the "core-site.xml". But
>when I checked the "core-default.xml" under path
>"$HADOOP_HOME/src/core/", I didn't find the parameter at all, nor
>could I find the parameter through JTUI after lauching jobs.
>Does anybody know about this parameter? Has it been removed from
>release 0.20.X? If it hasn't been removed, how could I set the
>parameter besides using the -D option? Thanks in advance.
>
>Best Regards,
>Carp


RE: sort at reduce side

2010-02-15 Thread Srigurunath Chakravarthi
>So, after shuffle at reduce side,  are the spills actually stored as map
>files?

Yes.

When a reducer receives map output from multiple maps worth fs.inmemorysize.mb 
in size, it sorts and spills the data to disk.

If the number of map output data files spilt to disk exceeds io.sort.factor, 
additional merge step(s) are performed to reduce the file count to lesser than 
that number.

When all map output data is spilt to disk, the reduce tasks starts invoking the 
reduce function and passes key-value pairs in sorted order by reading them off 
from the above files.

Regards,
Sriguru

>-Original Message-
>From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
>Sent: Thursday, February 04, 2010 1:58 AM
>To: common-user@hadoop.apache.org
>Subject: Re: sort at reduce side
>
>Thanks for reply, Sriguru.
>So, after shuffle at reduce side,  are the spills actually stored as map
>files?
>
>Why I ask these questions is based on some observations as following. On a
>16 nodes cluster, when I do a map join, it takes 3 and a half minutes. When
>I do a reduce side join on nearly the same amount of data, it take 8
>minutes before map phase complete. I am sure the computation (map function)
>will not cause so much difference, the extra 4 minutes time could be only
>spent on sorting at map side for reduce side join. While I also notice that
>the sort time at reduce side is only 30 sec (I cannot access the online
>jobtracker, the 30 sec time is actually the time reduce takes from 33%
>completeness to 66% completeness).  The number of reduce tasks is much
>fewer than that of map tasks, which means each reduce task sort more data
>than each map task (I use hash partitioner and data is uniformly
>distributed).  The only reason I come up with for the big difference
>between the sort at map side and reduce side is the different behaviors of
>these two sorts.
>
>Anybody has some ideas why the map takes so much time for reduce side join
>compared to map side join, and why there is big difference between sort at
>map side and reduce side?
>
>P.S. I join a 7.5G file with a 100M file. the sort buffer at reduce is
>slightly large than that at map side.
>
>
>-Gang
>
>
>
>- 原始邮件 
>发件人: Srigurunath Chakravarthi 
>收件人: "common-user@hadoop.apache.org" 
>发送日期: 2010/2/3 (周三) 12:50:08 上午
>主   题: RE: sort at reduce side
>
>Hi Gang,
>
>>kept in map file. If so, in order to efficiently sort the data, reducer
>>actually only read the index part of each spill (which is a map file) and
>>sort the keys, instead of reading whole records from disk and sort them.
>
>afaik, no. Reduces always fetches map output data and not indexes (even if
>the data is from the local node, where an index may be sufficient).
>
>Regards,
>Sriguru
>
>>-Original Message-
>>From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
>>Sent: Wednesday, February 03, 2010 10:40 AM
>>To: common-user@hadoop.apache.org
>>Subject: sort at reduce side
>>
>>Hi all,
>>I want to know some more details about the sorting at the reduce side.
>>
>>The intermediate result generated at the map side is stored as map file
>>which actually consists of two sub-files, namely index file and data file.
>>The index file stores the keys and it could point to corresponding record
>>stored in the data file.  What I think is that when intermediate result
>>(even only part of it for each mapper) is shuffled to reducer, it is still
>>kept in map file. If so, in order to efficiently sort the data, reducer
>>actually only read the index part of each spill (which is a map file) and
>>sort the keys, instead of reading whole records from disk and sort them.
>>
>>Does reducer actually do as what I expect?
>>
>>-Gang
>>
>>
>>  ___
>>  好玩贺卡等你发,邮箱贺卡全新上线!
>>http://card.mail.cn.yahoo.com/
>
>
>  ___
>  好玩贺卡等你发,邮箱贺卡全新上线!
>http://card.mail.cn.yahoo.com/


RE: heap memory

2010-02-08 Thread Srigurunath Chakravarthi
Sorry, I sent out an incomplete message. The last sentence should be:

You can request a larger heap for Map tasks by including -Xmx
in mapred.child.java.opts

regards,
Sriguru

>-Original Message-
>From: Srigurunath Chakravarthi
>Sent: Monday, February 08, 2010 11:51 PM
>To: 'common-user@hadoop.apache.org'
>Subject: RE: heap memory
>
>Hi Gang,
> Not sure if I understood your question right. Responses inline:
>
>>-Original Message-
>>From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
>>Sent: Friday, February 05, 2010 8:24 AM
>>To: common-user@hadoop.apache.org
>>Subject: heap memory
>>
>>HI all,
>>I suppose there is only map function that will consume the heap memory
>>assigned to each map task. While the default heap memory is 200 mb, I just
>>wonder most of the memory is wasted for a simple map function (e.g.
>>IdentityMapper).
>
>Map Output records get buffered before being spilt to disk. You can control
>this size via io.sort.mb.
>
>>
>>So, I try to make use of this memory by buffering the output records, or
>>maintaining large data structure in memory, but it doesn't work as I
>>expect. For example, I want to build a hash table on a 100mb table in
>>memory during the life time of that map task. But it fails due to lack of
>>heap memory. Don't I get 200mb heap memory? What others also eat my heap
>>memory?
>
>You can request a larger head for Map tasks by including -Xmx
>
>>
>>Thanks.
>>-Gang
>>
>>
>>
>>  ___
>>  好玩贺卡等你发,邮箱贺卡全新上线!
>>http://card.mail.cn.yahoo.com/


RE: heap memory

2010-02-08 Thread Srigurunath Chakravarthi
Hi Gang,
 Not sure if I understood your question right. Responses inline:

>-Original Message-
>From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
>Sent: Friday, February 05, 2010 8:24 AM
>To: common-user@hadoop.apache.org
>Subject: heap memory
>
>HI all,
>I suppose there is only map function that will consume the heap memory
>assigned to each map task. While the default heap memory is 200 mb, I just
>wonder most of the memory is wasted for a simple map function (e.g.
>IdentityMapper).

Map Output records get buffered before being spilt to disk. You can control 
this size via io.sort.mb.

>
>So, I try to make use of this memory by buffering the output records, or
>maintaining large data structure in memory, but it doesn't work as I
>expect. For example, I want to build a hash table on a 100mb table in
>memory during the life time of that map task. But it fails due to lack of
>heap memory. Don't I get 200mb heap memory? What others also eat my heap
>memory?

You can request a larger head for Map tasks by including -Xmx

>
>Thanks.
>-Gang
>
>
>
>  ___
>  好玩贺卡等你发,邮箱贺卡全新上线!
>http://card.mail.cn.yahoo.com/


RE: sort at reduce side

2010-02-02 Thread Srigurunath Chakravarthi
Hi Gang,

>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them. 

 afaik, no. Reduces always fetches map output data and not indexes (even if the 
data is from the local node, where an index may be sufficient).

Regards,
Sriguru

>-Original Message-
>From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
>Sent: Wednesday, February 03, 2010 10:40 AM
>To: common-user@hadoop.apache.org
>Subject: sort at reduce side
>
>Hi all,
>I want to know some more details about the sorting at the reduce side.
>
>The intermediate result generated at the map side is stored as map file
>which actually consists of two sub-files, namely index file and data file.
>The index file stores the keys and it could point to corresponding record
>stored in the data file.  What I think is that when intermediate result
>(even only part of it for each mapper) is shuffled to reducer, it is still
>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them.
>
>Does reducer actually do as what I expect?
>
>-Gang
>
>
>  ___
>  好玩贺卡等你发,邮箱贺卡全新上线!
>http://card.mail.cn.yahoo.com/


RE: Good idea to run NameNode and JobTracker on same machine?

2009-11-26 Thread Srigurunath Chakravarthi
Raymond,
Load wise, it should be very safe to run both JT and NN on a single node for 
small clusters (< 40 Task Trackers and/or Data Nodes). They don't use much CPU 
as such.

 This may even work for larger clusters depending on the type of hardware you 
have and the Hadoop job mix. We usually observe < 5% CPU load with ~80 DNs/TTs 
on an 8-code Intel processor based box with 16GB RAM.

 It is best that you observe CPU & mem load on the JT+NN node to take a call on 
whether to separate them. iostat, top or sar should tell you.

Regards,
Sriguru

>-Original Message-
>From: John Martyniak [mailto:j...@beforedawnsolutions.com]
>Sent: Friday, November 27, 2009 3:06 AM
>To: common-user@hadoop.apache.org
>Cc: 
>Subject: Re: Good idea to run NameNode and JobTracker on same machine?
>
>I have a cluster of 4 machines plus one machine to run nn & jt.  I
>have heard that 5 or 6 is the magic #.  I will see when I add the next
>batch of machines.
>
>And it seems to running fine.
>
>-Jogn
>
>On Nov 26, 2009, at 11:38 AM, Yongqiang He 
>wrote:
>
>> I think it is definitely not a good idea to combine these two in
>> production
>> environment.
>>
>> Thanks
>> Yongqiang
>> On 11/26/09 6:26 AM, "Raymond Jennings III" 
>> wrote:
>>
>>> Do people normally combine these two processes onto one machine?
>>> Currently I
>>> have them on separate machines but I am wondering they use that
>>> much CPU
>>> processing time and maybe I should combine them and create another
>>> DataNode.
>>>
>>>
>>>
>>>
>>>
>>
>>