On Thu, Jul 24, 2008 at 12:32 AM, Lincoln Ritter <[EMAIL PROTECTED]> wrote:
> Alejandro said: >> Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip) > > I'm muddling through both > http://issues.apache.org/jira/browse/HADOOP-2906 and > http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense > of these. I'm a little confused by the way this works but it looks > like I can define a number of named outputs which looks like it > enables different output formats and I can also define some of these > as "multi", meaning that I can write to different "targets" (like > files). Is this correct? Exactly. .... > A couple of questions: > > - I needed to pass 'null' to the collect method so as to not write > the key to the file. These files are meant to be consumable chunks of > content so I want to control exactly what goes into them. Does this > seem normal or am i missing something? Is there a downside to passing > null here? Not sure what happens if you write NULL as key or value. > - What is the 'part-00000' file for? I have seen this in other > places in the dfs. But it seems extraneous here. It's not super > critical but if I can make it go away that would be great. This is the standard output of the M/R job whatever is written the OutputCollector you get in the reduce() call (or in the map() call when reduce=0) > - What is the purpose of the '-r-00000' suffix? Perhaps it is to > help with collisions? Yes, files written from a map have '-m-', files written from a reduce have '-r-' > I guess it seems strange that I can't just say > "the output file should be called X" and have an output file called X > appear. Well, you need the map, reduce mask and the task number mask to avoid collisions.