As I don't have Hadoop installed (yet), I'm not able to offer a test case, but I'm fairly confident of a bug line TextInputFormat.
The current implementation will ignore the first line of a file split when the previous split ended with a newline. There are two ways to fix this, the easiest (and most efficient) is for the preceding split to always read up to the first newline in the succeeding split. Changing: http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup public class TextInputFormat extends InputFormatBase { ... return new RecordReader() { ... /** Read a line. */ public synchronized boolean next(Writable key, Writable value) throws IOException { long pos = in.getPos(); if (pos >= end) return false; to: if (pos > end) return false; will do the trick. Jim
