Re: map side only behavior

Jeff Zhang Fri, 29 Jan 2010 08:06:02 -0800

No, the merge and sort will not happen in mapper task. And each mapper task
will generate one output file.




2010/1/29 Gang Luo <lgpub...@yahoo.com.cn>

> Hi all,
> If I only use map side to process my data (set # of reducers to 0 ), what
> is the behavior of hadoop? Will it merge and sort each of the spills
> generated by one mapper?
>
> -Gang
>
>
> ----- 原始邮件 ----
> 发件人： Gang Luo <lgpub...@yahoo.com.cn>
> 收件人： common-user@hadoop.apache.org
> 发送日期： 2010/1/29 (周五) 8:54:33 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Yeah, I see how it works. Thanks Amogh.
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Amogh Vasekar <am...@yahoo-inc.com>
> 收件人： "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> 发送日期： 2010/1/28 (周四) 10:00:22 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Hi Gang,
> Yes PathFilters work only on file paths. I meant you can include such type
> of logic at split level.
> The input format's getSplits() method is responsible for computing and
> adding splits to a list container, for which JT initializes mapper tasks.
> You can override the getSplits() method to add only a few , say, based on
> the location or offset, to the list. Here's the reference :
> while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
>          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
>          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
>                                   blkLocations[blkIndex].getHosts()));
>          bytesRemaining -= splitSize;
>        }
>
>        if (bytesRemaining != 0) {
>          splits.add(new FileSplit(path, length-bytesRemaining,
> bytesRemaining,
>                     blkLocations[blkLocations.length-1].getHosts()));
>
> Before splits.add you can use your logic for discarding. However, you need
> to ensure your record reader takes care of incomplete records at boundaries.
>
> To get the block locations to load separately, the FileSystem class APIs
> expose few methods like getBlockLocations etc ..
> Hope this helps.
>
> Amogh
>
> On 1/28/10 7:26 PM, "Gang Luo" <lgpub...@yahoo.com.cn> wrote:
>
> Thanks Amogh.
>
> For the second part of my question, I actually mean loading block
> separately from HDFS. I don't know whether it is realistic. Anyway, for my
> goal is to process different division of a file separately, to do that at
> split level is OK. But even I can get the splits from inputformat, how to
> "add only a few splits you need to mapper and discard the others"?
> (pathfilters only works for file, but not block, I think).
>
> Thanks.
> -Gang
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>



-- 
Best Regards

Jeff Zhang

Re: map side only behavior

Reply via email to