The FileSplit boundaries are "rough" edges -- the mapper responsible for the previous split will continue until it finds a full record, and the next mapper will read ahead and only start on the first record boundary after the byte offset. - Aaron
On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo <wenrui....@ericsson.com> wrote: > I think the default TextInputFormat can meet my requirement. However, > even if the JavaDoc of TextInputFormat says the TextInputFormat could > divide input file as text lines which ends with CRLF. But I'd like to > know if the FileSplit size is not N times of line length, what will be > happen eventually? > > BR/anderson > > -----Original Message----- > From: jason hadoop [mailto:jason.had...@gmail.com] > Sent: Wednesday, June 10, 2009 8:39 PM > To: core-user@hadoop.apache.org > Subject: Re: Large size Text file split > > There is always NLineInputFormat. You specify the number of lines per > split. > The key is the position of the line start in the file, value is the line > itself. > The parameter mapred.line.input.format.linespermap controls the number > of lines per split > > On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < > harish.mallipe...@gmail.com> wrote: > > > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo <wenrui....@ericsson.com> > > wrote: > > > > > Hi, all > > > > > > I have a large csv file ( larger than 10 GB ), I'd like to use a > > > certain InputFormat to split it into smaller part thus each Mapper > > > can deal with piece of the csv file. However, as far as I know, > > > FileInputFormat only cares about byte size of file, that is, the > > > class can divide the csv file as many part, and maybe some part is > not a well-format CVS file. > > > For example, one line of the CSV file is not terminated with CRLF, > > > or maybe some text is trimed. > > > > > > How to ensure each FileSplit is a smaller valid CSV file using a > > > proper InputFormat? > > > > > > BR/anderson > > > > > > > If all you care about is the splits occurring at line boundaries, then > > > TextInputFormat will work. > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre > > d/TextInputFormat.html > > > > If not I guess you can write your own InputFormat class. > > > > -- > > Harish Mallipeddi > > http://blog.poundbang.in > > > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.apress.com/book/view/9781430219422 > www.prohadoopbook.com a community for Hadoop Professionals >