Re: get name of file in mapper output directory

2011-05-25 Thread Luca Pireddu
On May 25, 2011 00:28:10 Mark question wrote:
> thanks both for the comments, but even though finally, I managed to get the
> output file of the current mapper, I couldn't use it because apparently,
> mappers uses " _temporary" file while it's in process. So in Mapper.close ,
> the file for eg. "part-0" which it wrote to, does not exists yet.
> 
> There has to be another way to get the produced file. I need to sort it
> immediately within mappers.
> 
> Again, your thoughts are really helpful !
> 
> Mark

Indeed, output is written to the _temporary directory and then moved by a 
FileOutputCommitter once all tasks are done.

Why do you need to sort within the mappers?  Hadoop sorts as part of the 
regular workflow.  In fact, notice that your reducer receives the keys in 
sorted order.  You should probably look for a way to satisfy your goal by 
adapting bits of the workflow pipeline.

Maybe you should tell us what you're trying to achieve.  If the regular sort 
order isn't what you need, then just write a custom sort comparator class, 
which you insert into the workflow with Job.setSortComparatorClass.  I can 
point you to an example if you need.



-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452


Re: get name of file in mapper output directory

2011-05-24 Thread Mark question
thanks both for the comments, but even though finally, I managed to get the
output file of the current mapper, I couldn't use it because apparently,
mappers uses " _temporary" file while it's in process. So in Mapper.close ,
the file for eg. "part-0" which it wrote to, does not exists yet.

There has to be another way to get the produced file. I need to sort it
immediately within mappers.

Again, your thoughts are really helpful !

Mark

On Mon, May 23, 2011 at 5:51 AM, Luca Pireddu  wrote:

>
>
> The path is defined by the FileOutputFormat in use.  In particular, I think
> this function is responsible:
>
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext
> ,
> java.lang.String)
>
> It should give you the file path before all tasks have completed and the
> output
> is committed to the final output path.
>
> Luca
>
> On May 23, 2011 14:42:04 Joey Echeverria wrote:
> > Hi Mark,
> >
> > FYI, I'm moving the discussion over to
> > mapreduce-u...@hadoop.apache.org since your question is specific to
> > MapReduce.
> >
> > You can derive the output name from the TaskAttemptID which you can
> > get by calling getTaskAttemptID() on the context passed to your
> > cleanup() funciton. The task attempt id will look like this:
> >
> > attempt_200707121733_0003_m_05_0
> >
> > You're interested in the m_05 part, This gets translated into the
> > output file name part-m-5.
> >
> > -Joey
> >
> > On Sat, May 21, 2011 at 8:03 PM, Mark question 
> wrote:
> > > Hi,
> > >
> > >  I'm running a job with maps only  and I want by end of each map
> > > (ie.Close() function) to open the file that the current map has wrote
> > > using its output.collector.
> > >
> > >  I know "job.getWorkingDirectory()"  would give me the parent path of
> the
> > > file written, but how to get the full path or the name (ie. part-0
> or
> > > part-1).
> > >
> > > Thanks,
> > > Mark
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452
>


Re: get name of file in mapper output directory

2011-05-23 Thread Luca Pireddu


The path is defined by the FileOutputFormat in use.  In particular, I think 
this function is responsible:

http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext,
 
java.lang.String)

It should give you the file path before all tasks have completed and the output 
is committed to the final output path.

Luca

On May 23, 2011 14:42:04 Joey Echeverria wrote:
> Hi Mark,
> 
> FYI, I'm moving the discussion over to
> mapreduce-u...@hadoop.apache.org since your question is specific to
> MapReduce.
> 
> You can derive the output name from the TaskAttemptID which you can
> get by calling getTaskAttemptID() on the context passed to your
> cleanup() funciton. The task attempt id will look like this:
> 
> attempt_200707121733_0003_m_05_0
> 
> You're interested in the m_05 part, This gets translated into the
> output file name part-m-5.
> 
> -Joey
> 
> On Sat, May 21, 2011 at 8:03 PM, Mark question  wrote:
> > Hi,
> > 
> >  I'm running a job with maps only  and I want by end of each map
> > (ie.Close() function) to open the file that the current map has wrote
> > using its output.collector.
> > 
> >  I know "job.getWorkingDirectory()"  would give me the parent path of the
> > file written, but how to get the full path or the name (ie. part-0 or
> > part-1).
> > 
> > Thanks,
> > Mark

-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452


Re: get name of file in mapper output directory

2011-05-23 Thread Joey Echeverria
Hi Mark,

FYI, I'm moving the discussion over to
mapreduce-u...@hadoop.apache.org since your question is specific to
MapReduce.

You can derive the output name from the TaskAttemptID which you can
get by calling getTaskAttemptID() on the context passed to your
cleanup() funciton. The task attempt id will look like this:

attempt_200707121733_0003_m_05_0

You're interested in the m_05 part, This gets translated into the
output file name part-m-5.

-Joey

On Sat, May 21, 2011 at 8:03 PM, Mark question  wrote:
> Hi,
>
>  I'm running a job with maps only  and I want by end of each map
> (ie.Close() function) to open the file that the current map has wrote using
> its output.collector.
>
>  I know "job.getWorkingDirectory()"  would give me the parent path of the
> file written, but how to get the full path or the name (ie. part-0 or
> part-1).
>
> Thanks,
> Mark
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


get name of file in mapper output directory

2011-05-21 Thread Mark question
Hi,

  I'm running a job with maps only  and I want by end of each map
(ie.Close() function) to open the file that the current map has wrote using
its output.collector.

  I know "job.getWorkingDirectory()"  would give me the parent path of the
file written, but how to get the full path or the name (ie. part-0 or
part-1).

Thanks,
Mark