I am using Hadoop for one of my research. I use NLineInputFormat for Map, which take a few lines as one split. Each line specify a filename. So if I have 10 input files 1..10 in my hdfs home, I would have an input file list this:
*~/1* *~/2* *.* *.* *.* *~/10* It used to works fine but recently I ran into this problem: the Map phase could not finish because it always left out 1 split. For example if I have 2 splits: *09/08/23 15:32:02 INFO mapred.FileInputFormat: Total input paths to process : 1 09/08/23 15:32:03 INFO mapred.JobClient: Running job: job_200908101504_0075 09/08/23 15:32:04 INFO mapred.JobClient: map 0% reduce 0% 09/08/23 15:32:10 INFO mapred.JobClient: map 50% reduce 0% 09/08/23 15:32:20 INFO mapred.JobClient: map 50% reduce 8%* And then everything is stuck there. I don't know why reduce get to 8% even when Map is not finished. I am using Hadoop 0.19.1 I think this is Hadoop problem because at the very begining of each map task I print out the input value, which is the name of the file that will get processed. And when I look into the log of all mappers, many such output are missing, meaning some files's location they are not sent to Mapper. Any comment, suggestion on how to fix this is welcome. Another related question: Is there a better way to split Map inputs so that each raw binary file is one split, and the key = path of the file? SequenceInputFile seems to require that both <K,V> is stored within the file. Thanks, -- ---------------------------- Anh Nguyen http://www.im-nguyen.com