Re: map side only behavior

Aaron Kimball Sun, 31 Jan 2010 14:06:09 -0800

In a map-only job, map tasks will be connected directly to the OutputFormat.
So calling output.collect() / context.write() in the mapper will emit data
straight to files in HDFS without sorting the data. There is no sort buffer
involved. If you want exactly one output file, follow Nick's advice.


- Aaron

On Fri, Jan 29, 2010 at 8:32 AM, Jones, Nick <nick.jo...@amd.com> wrote:

> A single unity reducer should enforce a merge and sort to generate one
> file.
>
> Nick Jones
>
> -----Original Message-----
> From: Jeff Zhang [mailto:zjf...@gmail.com]
> Sent: Friday, January 29, 2010 10:06 AM
> To: common-user@hadoop.apache.org
> Subject: Re: map side only behavior
>
> No, the merge and sort will not happen in mapper task. And each mapper task
> will generate one output file.
>
>
>
> 2010/1/29 Gang Luo <lgpub...@yahoo.com.cn>
>
> > Hi all,
> > If I only use map side to process my data (set # of reducers to 0 ), what
> > is the behavior of hadoop? Will it merge and sort each of the spills
> > generated by one mapper?
> >
> > -Gang
> >
> >
> > ----- 原始邮件 ----
> > 发件人： Gang Luo <lgpub...@yahoo.com.cn>
> > 收件人： common-user@hadoop.apache.org
> > 发送日期： 2010/1/29 (周五) 8:54:33 上午
> > 主   题： Re: fine granularity operation on HDFS
> >
> > Yeah, I see how it works. Thanks Amogh.
> >
> >
> > -Gang
> >
> >
> >
> > ----- 原始邮件 ----
> > 发件人： Amogh Vasekar <am...@yahoo-inc.com>
> > 收件人： "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> > 发送日期： 2010/1/28 (周四) 10:00:22 上午
> > 主   题： Re: fine granularity operation on HDFS
> >
> > Hi Gang,
> > Yes PathFilters work only on file paths. I meant you can include such
> type
> > of logic at split level.
> > The input format's getSplits() method is responsible for computing and
> > adding splits to a list container, for which JT initializes mapper tasks.
> > You can override the getSplits() method to add only a few , say, based on
> > the location or offset, to the list. Here's the reference :
> > while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
> >          int blkIndex = getBlockIndex(blkLocations,
> length-bytesRemaining);
> >          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
> >                                   blkLocations[blkIndex].getHosts()));
> >          bytesRemaining -= splitSize;
> >        }
> >
> >        if (bytesRemaining != 0) {
> >          splits.add(new FileSplit(path, length-bytesRemaining,
> > bytesRemaining,
> >                     blkLocations[blkLocations.length-1].getHosts()));
> >
> > Before splits.add you can use your logic for discarding. However, you
> need
> > to ensure your record reader takes care of incomplete records at
> boundaries.
> >
> > To get the block locations to load separately, the FileSystem class APIs
> > expose few methods like getBlockLocations etc ..
> > Hope this helps.
> >
> > Amogh
> >
> > On 1/28/10 7:26 PM, "Gang Luo" <lgpub...@yahoo.com.cn> wrote:
> >
> > Thanks Amogh.
> >
> > For the second part of my question, I actually mean loading block
> > separately from HDFS. I don't know whether it is realistic. Anyway, for
> my
> > goal is to process different division of a file separately, to do that at
> > split level is OK. But even I can get the splits from inputformat, how to
> > "add only a few splits you need to mapper and discard the others"?
> > (pathfilters only works for file, but not block, I think).
> >
> > Thanks.
> > -Gang
> >
> >
> >
> >      ___________________________________________________________
> >  好玩贺卡等你发，邮箱贺卡全新上线！
> > http://card.mail.cn.yahoo.com/
> >
> >
> >
> >      ___________________________________________________________
> >  好玩贺卡等你发，邮箱贺卡全新上线！
> > http://card.mail.cn.yahoo.com/
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: map side only behavior

Reply via email to