Re: Large size Text file split
Thanks Zhengguo for your answer. I have read the source of LineRecordReader, it seems that the start and end point is determined roughly by FileSplit. I track the code to FileSplit and found that the split is made by FileInputFormat's getSplits() function. The FileSplit is rough and the record integrality is ensured in LineRecordReader. On Thu, Jun 11, 2009 at 11:12 PM, Zhengguo 'Mike' SUNzhengguo...@yahoo.com wrote: Mapper2 doesn't wait for Mapper1. They starts at the same time. It knows the real record by looking at the characters he reads. If he sees a newline, then that is the start of a real record. It discards all the stuff before that newline. Check the source code of LineRecordReader. You will get more detailed information for that. From: Zhong Wang wangzhong@gmail.com To: core-user@hadoop.apache.org Sent: Thursday, June 11, 2009 10:47:48 AM Subject: Re: Large size Text file split Mapper 2 starts reading at byte 1. It finds the first newline at byte 10020, so the first real record it processes starts at byte 10021. There's one problem: how does Mapper2 know the real record start at 10021 before Mapper1 reach the end of Split1 ()? Mappers starts at the same time. -- Zhong Wang -- Zhong Wang
Re: Large size Text file split
Mapper 2 starts reading at byte 1. It finds the first newline at byte 10020, so the first real record it processes starts at byte 10021. There's one problem: how does Mapper2 know the real record start at 10021 before Mapper1 reach the end of Split1 ()? Mappers starts at the same time. -- Zhong Wang
Re: Large size Text file split
Mapper2 doesn't wait for Mapper1. They starts at the same time. It knows the real record by looking at the characters he reads. If he sees a newline, then that is the start of a real record. It discards all the stuff before that newline. Check the source code of LineRecordReader. You will get more detailed information for that. From: Zhong Wang wangzhong@gmail.com To: core-user@hadoop.apache.org Sent: Thursday, June 11, 2009 10:47:48 AM Subject: Re: Large size Text file split Mapper 2 starts reading at byte 1. It finds the first newline at byte 10020, so the first real record it processes starts at byte 10021. There's one problem: how does Mapper2 know the real record start at 10021 before Mapper1 reach the end of Split1 ()? Mappers starts at the same time. -- Zhong Wang
Large size Text file split
Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson
Re: Large size Text file split
On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com wrote: Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in
Re: Large size Text file split
There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com wrote: Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
RE: Large size Text file split
I think the default TextInputFormat can meet my requirement. However, even if the JavaDoc of TextInputFormat says the TextInputFormat could divide input file as text lines which ends with CRLF. But I'd like to know if the FileSplit size is not N times of line length, what will be happen eventually? BR/anderson -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: Wednesday, June 10, 2009 8:39 PM To: core-user@hadoop.apache.org Subject: Re: Large size Text file split There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com wrote: Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre d/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Large size Text file split
The FileSplit boundaries are rough edges -- the mapper responsible for the previous split will continue until it finds a full record, and the next mapper will read ahead and only start on the first record boundary after the byte offset. - Aaron On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo wenrui@ericsson.com wrote: I think the default TextInputFormat can meet my requirement. However, even if the JavaDoc of TextInputFormat says the TextInputFormat could divide input file as text lines which ends with CRLF. But I'd like to know if the FileSplit size is not N times of line length, what will be happen eventually? BR/anderson -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: Wednesday, June 10, 2009 8:39 PM To: core-user@hadoop.apache.org Subject: Re: Large size Text file split There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com wrote: Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre d/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
RE: Large size Text file split
I don't understand the internal logic of the FileSplit and Mapper. By my understanding, I think FileInputFormat is the actual class that takes care of the file spliting. So it's reasonable if one large file is splited into 5 smaller parts, each parts is less than 2GB (since we specify the numberOfSplit is 5). However, the FileSplit is rough edges, so mapper 1 which takes the split 1 as input omit the incomplete parts at the end of split 1, then mapper 2 will continue to read that incomplete part then add the remaining part of split 2? Take this as example: The original file is: 1::122::5::838985046 (CRLF) 1::185::5::838983525 (CRLF) 1::231::5::838983392 (CRLF) Assume number of split is 2, then the above content is divied into two part: Split 1: 1::122::5::838985046 (CRLF) 1::185::5::8 Split 2: 38983525 (CRLF) 1::231::5::838983392 (CRLF) Afterwards, Mapper 1 takes split 1 as input, but after eat the line 1::122::5::838985046, it found the remaining part is not a complete record, then Mapper 1 bypass it, but Mapper 2 will read this and add it ahead of first line of Split 2 to compose a valid record. Is it correct ? If it is, which class implements the above logic? BR/anderson -Original Message- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Thursday, June 11, 2009 11:49 AM To: core-user@hadoop.apache.org Subject: Re: Large size Text file split The FileSplit boundaries are rough edges -- the mapper responsible for the previous split will continue until it finds a full record, and the next mapper will read ahead and only start on the first record boundary after the byte offset. - Aaron On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo wenrui@ericsson.com wrote: I think the default TextInputFormat can meet my requirement. However, even if the JavaDoc of TextInputFormat says the TextInputFormat could divide input file as text lines which ends with CRLF. But I'd like to know if the FileSplit size is not N times of line length, what will be happen eventually? BR/anderson -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: Wednesday, June 10, 2009 8:39 PM To: core-user@hadoop.apache.org Subject: Re: Large size Text file split There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com wrote: Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/map re d/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals