On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo <wenrui....@ericsson.com> wrote:
> Hi, all > > I have a large csv file ( larger than 10 GB ), I'd like to use a certain > InputFormat to split it into smaller part thus each Mapper can deal with > piece of the csv file. However, as far as I know, FileInputFormat only > cares about byte size of file, that is, the class can divide the csv > file as many part, and maybe some part is not a well-format CVS file. > For example, one line of the CSV file is not terminated with CRLF, or > maybe some text is trimed. > > How to ensure each FileSplit is a smaller valid CSV file using a proper > InputFormat? > > BR/anderson > If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in