In a map-only job, map tasks will be connected directly to the OutputFormat. So calling output.collect() / context.write() in the mapper will emit data straight to files in HDFS without sorting the data. There is no sort buffer involved. If you want exactly one output file, follow Nick's advice.
- Aaron On Fri, Jan 29, 2010 at 8:32 AM, Jones, Nick <nick.jo...@amd.com> wrote: > A single unity reducer should enforce a merge and sort to generate one > file. > > Nick Jones > > -----Original Message----- > From: Jeff Zhang [mailto:zjf...@gmail.com] > Sent: Friday, January 29, 2010 10:06 AM > To: common-user@hadoop.apache.org > Subject: Re: map side only behavior > > No, the merge and sort will not happen in mapper task. And each mapper task > will generate one output file. > > > > 2010/1/29 Gang Luo <lgpub...@yahoo.com.cn> > > > Hi all, > > If I only use map side to process my data (set # of reducers to 0 ), what > > is the behavior of hadoop? Will it merge and sort each of the spills > > generated by one mapper? > > > > -Gang > > > > > > ----- 原始邮件 ---- > > 发件人: Gang Luo <lgpub...@yahoo.com.cn> > > 收件人: common-user@hadoop.apache.org > > 发送日期: 2010/1/29 (周五) 8:54:33 上午 > > 主 题: Re: fine granularity operation on HDFS > > > > Yeah, I see how it works. Thanks Amogh. > > > > > > -Gang > > > > > > > > ----- 原始邮件 ---- > > 发件人: Amogh Vasekar <am...@yahoo-inc.com> > > 收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org> > > 发送日期: 2010/1/28 (周四) 10:00:22 上午 > > 主 题: Re: fine granularity operation on HDFS > > > > Hi Gang, > > Yes PathFilters work only on file paths. I meant you can include such > type > > of logic at split level. > > The input format's getSplits() method is responsible for computing and > > adding splits to a list container, for which JT initializes mapper tasks. > > You can override the getSplits() method to add only a few , say, based on > > the location or offset, to the list. Here's the reference : > > while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { > > int blkIndex = getBlockIndex(blkLocations, > length-bytesRemaining); > > splits.add(new FileSplit(path, length-bytesRemaining, splitSize, > > blkLocations[blkIndex].getHosts())); > > bytesRemaining -= splitSize; > > } > > > > if (bytesRemaining != 0) { > > splits.add(new FileSplit(path, length-bytesRemaining, > > bytesRemaining, > > blkLocations[blkLocations.length-1].getHosts())); > > > > Before splits.add you can use your logic for discarding. However, you > need > > to ensure your record reader takes care of incomplete records at > boundaries. > > > > To get the block locations to load separately, the FileSystem class APIs > > expose few methods like getBlockLocations etc .. > > Hope this helps. > > > > Amogh > > > > On 1/28/10 7:26 PM, "Gang Luo" <lgpub...@yahoo.com.cn> wrote: > > > > Thanks Amogh. > > > > For the second part of my question, I actually mean loading block > > separately from HDFS. I don't know whether it is realistic. Anyway, for > my > > goal is to process different division of a file separately, to do that at > > split level is OK. But even I can get the splits from inputformat, how to > > "add only a few splits you need to mapper and discard the others"? > > (pathfilters only works for file, but not block, I think). > > > > Thanks. > > -Gang > > > > > > > > ___________________________________________________________ > > 好玩贺卡等你发,邮箱贺卡全新上线! > > http://card.mail.cn.yahoo.com/ > > > > > > > > ___________________________________________________________ > > 好玩贺卡等你发,邮箱贺卡全新上线! > > http://card.mail.cn.yahoo.com/ > > > > > > -- > Best Regards > > Jeff Zhang >