You can use NLineInputFormat for this, which splits one line (N=1, by default) as one split.
So, each map task processes one line.
See http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari
S D wrote:
Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop streaming
without a reduce function to process each file and save the results (back to
S3 native in my case). To handle to long processing time of each file I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
reading from STDIN:

STDIN.each_line do |line|
   # Get file from contents of line
   # Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being allocated
one task each and the other worker sitting idly. If I split my input file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether a
new task is allocated per file or line. My understanding of Hadoop Java is
that (unlike Hadoop streaming) when given a single input file, the file will
be broken up into separate lines and the maximum number of map tasks will
automagically be allocated to handle the lines of the file (assuming the use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD


Reply via email to