a) this is a small file by hadoop standards. You should be able to process it by conventional methods on a single machine in about the same time it takes to start a hadoop job that does nothing at all.
b) reading a single line at a time is not as inefficient as you might think. If you write a mapper that reads each line, converts to an integer and outputs a key consisting of a constant integer and the data you read, the mapper will process the data reasonably quickly. If you add a combiner and a reducer that add up numbers in a list, then the amount of data spilled will be nearly zero. On Fri, Dec 17, 2010 at 7:58 AM, madhu phatak <phatak....@gmail.com> wrote: > Hi > I have a very large file of size 1.4 GB. Each line of the file is a number > . > I want to find the sum all those numbers. > I wanted to use NLineInputFormat as a InputFormat but it sends only one > line > to the Mapper which is very in efficient. > So can you guide me to write a InputFormat which splits the file > into multiple Splits and each mapper can read multiple > line from each split > > Regards > Madhukar >