a) this is a small file by hadoop standards.  You should be able to process
it by conventional methods on a single machine in about the same time it
takes to start a hadoop job that does nothing at all.

b) reading a single line at a time is not as inefficient as you might think.
 If you write a mapper that reads each line, converts to an integer and
outputs a key consisting of a constant integer and the data you read, the
mapper will process the data reasonably quickly.  If you add a combiner and
a reducer that add up numbers in a list, then the amount of data spilled
will be nearly zero.


On Fri, Dec 17, 2010 at 7:58 AM, madhu phatak <phatak....@gmail.com> wrote:

> Hi
> I have a very large file of size 1.4 GB. Each line of the file is a number
> .
> I want to find the sum all those numbers.
> I wanted to use NLineInputFormat as a InputFormat but it sends only one
> line
> to the Mapper which is very in efficient.
> So can you guide me to write a InputFormat which splits the file
> into multiple Splits and each mapper can read multiple
> line from each split
>
> Regards
> Madhukar
>

Reply via email to