Re: How to write one file per key as mapreduce output

Alejandro Abdelnur Tue, 29 Jul 2008 09:28:16 -0700

On Thu, Jul 24, 2008 at 12:32 AM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:


> Alejandro said:
>> Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip)
>
> I'm muddling through both
> http://issues.apache.org/jira/browse/HADOOP-2906 and
> http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense
> of these.  I'm a little confused by the way this works but it looks
> like I can define a number of named outputs which looks like it
> enables different output formats and I can also define some of these
> as "multi", meaning that I can write to different "targets" (like
> files).  Is this correct?

Exactly.

....

> A couple of questions:
>
>  - I needed to pass 'null' to the collect method so as to not write
> the key to the file.  These files are meant to be consumable chunks of
> content so I want to control exactly what goes into them.  Does this
> seem normal or am i missing something?  Is there a downside to passing
> null here?

Not sure what happens if you write NULL as key or value.

>  - What is the 'part-00000' file for?  I have seen this in other
> places in the dfs. But it seems extraneous here.  It's not super
> critical but if I can make it go away that would be great.

This is the standard output of the M/R job whatever is written the
OutputCollector you get in the reduce() call (or in the map() call
when reduce=0)

>  - What is the purpose of the '-r-00000' suffix?  Perhaps it is to
> help with collisions?

Yes, files written from a map have '-m-', files written from a reduce have '-r-'

> I guess it seems strange that I can't just say
> "the output file should be called X" and have an output file called X
> appear.

Well, you need the map, reduce mask and the task number mask to avoid
collisions.

Re: How to write one file per key as mapreduce output

Reply via email to