Re: Best approach for accessing secondary map task outputs from reduce tasks?

2011-02-13 Thread Jacques
Everybody, thanks for all the help. Chris/Jason, while 1) assumption is actually incorrect for my situation. Nonetheless, I can see how one would basically use a dynamic-typing approach to sending the additional data as a first keys for each partition. It seems less than elegant but doable. The

Re: Best approach for accessing secondary map task outputs from reduce tasks?

2011-02-13 Thread Jason
I think this kind of partitioner is a little hackish. More straight forward approach is to emit the extra data N times under special keys and write a partitioner that would recognize these keys and dispatch them accordingly between partitions 0..N-1 Also if this data needs to be shipped to reduc

Re: Best approach for accessing secondary map task outputs from reduce tasks?

2011-02-13 Thread Harsh J
>From my experience, writing data is possible using MO in both Map and Reduce sides of a single MR job. All data written to the MO name in map-side is committed just like it would if the job were a map-only job (there's no difference, since a map task does not wait for reduce tasks to begin - it is

Re: Best approach for accessing secondary map task outputs from reduce tasks?

2011-02-13 Thread Chris Douglas
If these assumptions are correct: 0) Each map outputs one result, a few hundred bytes 1) The map output is deterministic, given an input split index 2) Every reducer must see the result from every map Then just output the result N times, where N is the number of reducers, using a custom Partition

Re: Best approach for accessing secondary map task outputs from reduce tasks?

2011-02-13 Thread Jacques
It was my understanding based on the FAQ and my personal experience, that using the MutlipleOutputs class, or just relying on OutputComitter only works for the final phase of the job. (E.g. the reduce phase in a map+reduce job and the map phase only in the case of reducer=NONE). In the case I'm t

Re: Best approach for accessing secondary map task outputs from reduce tasks?

2011-02-13 Thread Harsh J
With just HDFS, IMO the good approach would be (2). See this FAQ on task-specific HDFS output directories you can use: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F. It'd also be much easier to use the MultipleOutputs class (o

Re: get Map tasks info in command line

2011-02-13 Thread Mahadev Konar
you should be able to use hadoop job -events to get the task completion events from the job tracker. Here is a link: http://hadoop.apache.org/common/docs/current/commands_manual.html#job thanks mahadev On Sun, Feb 13, 2011 at 8:45 AM, Pedro Costa wrote: > Hi, > > 1 - How do I get the name of t

Best approach for accessing secondary map task outputs from reduce tasks?

2011-02-13 Thread Jacques
I'm outputting a small amount of secondary summary information from a map task that I want to use in the reduce phase of the job. This information is keyed on a custom input split index. Each map task outputs this summary information (less than hundred bytes per input task). Note that the summar

get Map tasks info in command line

2011-02-13 Thread Pedro Costa
Hi, 1 - How do I get the name of the map tasks the ran in the command line? 2 - How do I get the start time and the end time of a map task in the command line? -- Pedro

get duration of MR tasks by command line

2011-02-13 Thread Pedro Costa
Hi, I would like to get the duration of each Map and Reduce took to run by command line. how is this possible? Thanks, -- Pedro

Save Hadoop examples results locally?

2011-02-13 Thread Pedro Costa
Hi, I'm running GridMix2 examples and I would like to retrieve all the results produced by the tests and save the files locally, to read the later and offline. Does exists any command for that? Thanks -- Pedro