No, the merge and sort will not happen in mapper task. And each mapper task will generate one output file.
2010/1/29 Gang Luo <lgpub...@yahoo.com.cn> > Hi all, > If I only use map side to process my data (set # of reducers to 0 ), what > is the behavior of hadoop? Will it merge and sort each of the spills > generated by one mapper? > > -Gang > > > ----- 原始邮件 ---- > 发件人: Gang Luo <lgpub...@yahoo.com.cn> > 收件人: common-user@hadoop.apache.org > 发送日期: 2010/1/29 (周五) 8:54:33 上午 > 主 题: Re: fine granularity operation on HDFS > > Yeah, I see how it works. Thanks Amogh. > > > -Gang > > > > ----- 原始邮件 ---- > 发件人: Amogh Vasekar <am...@yahoo-inc.com> > 收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org> > 发送日期: 2010/1/28 (周四) 10:00:22 上午 > 主 题: Re: fine granularity operation on HDFS > > Hi Gang, > Yes PathFilters work only on file paths. I meant you can include such type > of logic at split level. > The input format's getSplits() method is responsible for computing and > adding splits to a list container, for which JT initializes mapper tasks. > You can override the getSplits() method to add only a few , say, based on > the location or offset, to the list. Here's the reference : > while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { > int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); > splits.add(new FileSplit(path, length-bytesRemaining, splitSize, > blkLocations[blkIndex].getHosts())); > bytesRemaining -= splitSize; > } > > if (bytesRemaining != 0) { > splits.add(new FileSplit(path, length-bytesRemaining, > bytesRemaining, > blkLocations[blkLocations.length-1].getHosts())); > > Before splits.add you can use your logic for discarding. However, you need > to ensure your record reader takes care of incomplete records at boundaries. > > To get the block locations to load separately, the FileSystem class APIs > expose few methods like getBlockLocations etc .. > Hope this helps. > > Amogh > > On 1/28/10 7:26 PM, "Gang Luo" <lgpub...@yahoo.com.cn> wrote: > > Thanks Amogh. > > For the second part of my question, I actually mean loading block > separately from HDFS. I don't know whether it is realistic. Anyway, for my > goal is to process different division of a file separately, to do that at > split level is OK. But even I can get the splits from inputformat, how to > "add only a few splits you need to mapper and discard the others"? > (pathfilters only works for file, but not block, I think). > > Thanks. > -Gang > > > > ___________________________________________________________ > 好玩贺卡等你发,邮箱贺卡全新上线! > http://card.mail.cn.yahoo.com/ > > > > ___________________________________________________________ > 好玩贺卡等你发,邮箱贺卡全新上线! > http://card.mail.cn.yahoo.com/ > -- Best Regards Jeff Zhang