The FileSplit boundaries are "rough" edges -- the mapper responsible for the
previous split will continue until it finds a full record, and the next
mapper will read ahead and only start on the first record boundary after the
byte offset.
- Aaron

On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo <wenrui....@ericsson.com> wrote:

> I think the default TextInputFormat can meet my requirement. However,
> even if the JavaDoc of TextInputFormat says the TextInputFormat could
> divide input file as text lines which ends with CRLF. But I'd like to
> know if the FileSplit size is not N times of line length, what will be
> happen eventually?
>
> BR/anderson
>
> -----Original Message-----
> From: jason hadoop [mailto:jason.had...@gmail.com]
> Sent: Wednesday, June 10, 2009 8:39 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Large size Text file split
>
> There is always NLineInputFormat. You specify the number of lines per
> split.
> The key is the position of the line start in the file, value is the line
> itself.
> The parameter mapred.line.input.format.linespermap controls the number
> of lines per split
>
> On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi <
> harish.mallipe...@gmail.com> wrote:
>
> > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo <wenrui....@ericsson.com>
> > wrote:
> >
> > > Hi, all
> > >
> > > I have a large csv file ( larger than 10 GB ), I'd like to use a
> > > certain InputFormat to split it into smaller part thus each Mapper
> > > can deal with piece of the csv file. However, as far as I know,
> > > FileInputFormat only cares about byte size of file, that is, the
> > > class can divide the csv file as many part, and maybe some part is
> not a well-format CVS file.
> > > For example, one line of the CSV file is not terminated with CRLF,
> > > or maybe some text is trimed.
> > >
> > > How to ensure each FileSplit is a smaller valid CSV file using a
> > > proper InputFormat?
> > >
> > > BR/anderson
> > >
> >
> > If all you care about is the splits occurring at line boundaries, then
>
> > TextInputFormat will work.
> >
> > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre
> > d/TextInputFormat.html
> >
> > If not I guess you can write your own InputFormat class.
> >
> > --
> > Harish Mallipeddi
> > http://blog.poundbang.in
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>

Reply via email to