TextInputFormat bug - lines which are not split

Jim White Mon, 21 Aug 2006 10:23:54 -0700

As I don't have Hadoop installed (yet), I'm not able to offer a test
case, but I'm fairly confident of a bug line TextInputFormat.


The current implementation will ignore the first line of a file split
when the previous split ended with a newline.

There are two ways to fix this, the easiest (and most efficient) is for
the preceding split to always read up to the first newline in the
succeeding split.

Changing:

http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup

public class TextInputFormat extends InputFormatBase {
...
    return new RecordReader() {
...
        /** Read a line. */
        public synchronized boolean next(Writable key, Writable value)
          throws IOException {
          long pos = in.getPos();
          if (pos >= end)
            return false;

to:

          if (pos > end)
            return false;

will do the trick.

Jim

TextInputFormat bug - lines which are *not* split

Reply via email to

TextInputFormat bug - lines which are not split