On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo <wenrui....@ericsson.com> wrote:

> Hi, all
>
> I have a large csv file ( larger than 10 GB ), I'd like to use a certain
> InputFormat to split it into smaller part thus each Mapper can deal with
> piece of the csv file. However, as far as I know, FileInputFormat only
> cares about byte size of file, that is, the class can divide the csv
> file as many part, and maybe some part is not a well-format CVS file.
> For example, one line of the CSV file is not terminated with CRLF, or
> maybe some text is trimed.
>
> How to ensure each FileSplit is a smaller valid CSV file using a proper
> InputFormat?
>
> BR/anderson
>

If all you care about is the splits occurring at line boundaries, then
TextInputFormat will work.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html

If not I guess you can write your own InputFormat class.

-- 
Harish Mallipeddi
http://blog.poundbang.in

Reply via email to