Re: Large size Text file split

2009-06-12 Thread Zhong Wang
Thanks Zhengguo for your answer.

I have read the source of LineRecordReader, it seems that the start
and end point is determined  roughly by FileSplit. I track the code to
FileSplit and found that the split is made by FileInputFormat's
getSplits() function. The FileSplit is rough and the record
integrality is ensured in LineRecordReader.


On Thu, Jun 11, 2009 at 11:12 PM, Zhengguo 'Mike'
SUNzhengguo...@yahoo.com wrote:
 Mapper2 doesn't wait for Mapper1. They starts at the same time. It knows the 
 real record by looking at the characters he reads. If he sees a newline, 
 then that is the start of a real record. It discards all the stuff before 
 that newline. Check the source code of LineRecordReader. You will get more 
 detailed information for that.

 
 From: Zhong Wang wangzhong@gmail.com
 To: core-user@hadoop.apache.org
 Sent: Thursday, June 11, 2009 10:47:48 AM
 Subject: Re: Large size Text file split

 Mapper 2 starts reading at byte 1. It finds the first newline at byte
 10020, so the first real record it processes starts at byte 10021.


 There's one problem: how does Mapper2 know the real record start at
 10021 before Mapper1 reach the end of Split1 ()? Mappers starts at
 the same time.


 --
 Zhong Wang







-- 
Zhong Wang


Re: Large size Text file split

2009-06-11 Thread Zhong Wang
 Mapper 2 starts reading at byte 1. It finds the first newline at byte
 10020, so the first real record it processes starts at byte 10021.


There's one problem: how does Mapper2 know the real record start at
10021 before Mapper1 reach the end of Split1 ()? Mappers starts at
the same time.

-- 
Zhong Wang


Re: Large size Text file split

2009-06-11 Thread Zhengguo 'Mike' SUN
Mapper2 doesn't wait for Mapper1. They starts at the same time. It knows the 
real record by looking at the characters he reads. If he sees a newline, then 
that is the start of a real record. It discards all the stuff before that 
newline. Check the source code of LineRecordReader. You will get more detailed 
information for that.


From: Zhong Wang wangzhong@gmail.com
To: core-user@hadoop.apache.org
Sent: Thursday, June 11, 2009 10:47:48 AM
Subject: Re: Large size Text file split

 Mapper 2 starts reading at byte 1. It finds the first newline at byte
 10020, so the first real record it processes starts at byte 10021.


There's one problem: how does Mapper2 know the real record start at
10021 before Mapper1 reach the end of Split1 ()? Mappers starts at
the same time.


-- 
Zhong Wang



  

Large size Text file split

2009-06-10 Thread Wenrui Guo
Hi, all

I have a large csv file ( larger than 10 GB ), I'd like to use a certain
InputFormat to split it into smaller part thus each Mapper can deal with
piece of the csv file. However, as far as I know, FileInputFormat only
cares about byte size of file, that is, the class can divide the csv
file as many part, and maybe some part is not a well-format CVS file.
For example, one line of the CSV file is not terminated with CRLF, or
maybe some text is trimed.

How to ensure each FileSplit is a smaller valid CSV file using a proper
InputFormat?

BR/anderson 


Re: Large size Text file split

2009-06-10 Thread Harish Mallipeddi
On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com wrote:

 Hi, all

 I have a large csv file ( larger than 10 GB ), I'd like to use a certain
 InputFormat to split it into smaller part thus each Mapper can deal with
 piece of the csv file. However, as far as I know, FileInputFormat only
 cares about byte size of file, that is, the class can divide the csv
 file as many part, and maybe some part is not a well-format CVS file.
 For example, one line of the CSV file is not terminated with CRLF, or
 maybe some text is trimed.

 How to ensure each FileSplit is a smaller valid CSV file using a proper
 InputFormat?

 BR/anderson


If all you care about is the splits occurring at line boundaries, then
TextInputFormat will work.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html

If not I guess you can write your own InputFormat class.

-- 
Harish Mallipeddi
http://blog.poundbang.in


Re: Large size Text file split

2009-06-10 Thread jason hadoop
There is always NLineInputFormat. You specify the number of lines per split.
The key is the position of the line start in the file, value is the line
itself.
The parameter mapred.line.input.format.linespermap controls the number of
lines per split

On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi 
harish.mallipe...@gmail.com wrote:

 On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com
 wrote:

  Hi, all
 
  I have a large csv file ( larger than 10 GB ), I'd like to use a certain
  InputFormat to split it into smaller part thus each Mapper can deal with
  piece of the csv file. However, as far as I know, FileInputFormat only
  cares about byte size of file, that is, the class can divide the csv
  file as many part, and maybe some part is not a well-format CVS file.
  For example, one line of the CSV file is not terminated with CRLF, or
  maybe some text is trimed.
 
  How to ensure each FileSplit is a smaller valid CSV file using a proper
  InputFormat?
 
  BR/anderson
 

 If all you care about is the splits occurring at line boundaries, then
 TextInputFormat will work.

 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html

 If not I guess you can write your own InputFormat class.

 --
 Harish Mallipeddi
 http://blog.poundbang.in




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


RE: Large size Text file split

2009-06-10 Thread Wenrui Guo
I think the default TextInputFormat can meet my requirement. However,
even if the JavaDoc of TextInputFormat says the TextInputFormat could
divide input file as text lines which ends with CRLF. But I'd like to
know if the FileSplit size is not N times of line length, what will be
happen eventually?

BR/anderson 

-Original Message-
From: jason hadoop [mailto:jason.had...@gmail.com] 
Sent: Wednesday, June 10, 2009 8:39 PM
To: core-user@hadoop.apache.org
Subject: Re: Large size Text file split

There is always NLineInputFormat. You specify the number of lines per
split.
The key is the position of the line start in the file, value is the line
itself.
The parameter mapred.line.input.format.linespermap controls the number
of lines per split

On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi 
harish.mallipe...@gmail.com wrote:

 On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com
 wrote:

  Hi, all
 
  I have a large csv file ( larger than 10 GB ), I'd like to use a 
  certain InputFormat to split it into smaller part thus each Mapper 
  can deal with piece of the csv file. However, as far as I know, 
  FileInputFormat only cares about byte size of file, that is, the 
  class can divide the csv file as many part, and maybe some part is
not a well-format CVS file.
  For example, one line of the CSV file is not terminated with CRLF, 
  or maybe some text is trimed.
 
  How to ensure each FileSplit is a smaller valid CSV file using a 
  proper InputFormat?
 
  BR/anderson
 

 If all you care about is the splits occurring at line boundaries, then

 TextInputFormat will work.

 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre
 d/TextInputFormat.html

 If not I guess you can write your own InputFormat class.

 --
 Harish Mallipeddi
 http://blog.poundbang.in




--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Large size Text file split

2009-06-10 Thread Aaron Kimball
The FileSplit boundaries are rough edges -- the mapper responsible for the
previous split will continue until it finds a full record, and the next
mapper will read ahead and only start on the first record boundary after the
byte offset.
- Aaron

On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo wenrui@ericsson.com wrote:

 I think the default TextInputFormat can meet my requirement. However,
 even if the JavaDoc of TextInputFormat says the TextInputFormat could
 divide input file as text lines which ends with CRLF. But I'd like to
 know if the FileSplit size is not N times of line length, what will be
 happen eventually?

 BR/anderson

 -Original Message-
 From: jason hadoop [mailto:jason.had...@gmail.com]
 Sent: Wednesday, June 10, 2009 8:39 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Large size Text file split

 There is always NLineInputFormat. You specify the number of lines per
 split.
 The key is the position of the line start in the file, value is the line
 itself.
 The parameter mapred.line.input.format.linespermap controls the number
 of lines per split

 On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi 
 harish.mallipe...@gmail.com wrote:

  On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com
  wrote:
 
   Hi, all
  
   I have a large csv file ( larger than 10 GB ), I'd like to use a
   certain InputFormat to split it into smaller part thus each Mapper
   can deal with piece of the csv file. However, as far as I know,
   FileInputFormat only cares about byte size of file, that is, the
   class can divide the csv file as many part, and maybe some part is
 not a well-format CVS file.
   For example, one line of the CSV file is not terminated with CRLF,
   or maybe some text is trimed.
  
   How to ensure each FileSplit is a smaller valid CSV file using a
   proper InputFormat?
  
   BR/anderson
  
 
  If all you care about is the splits occurring at line boundaries, then

  TextInputFormat will work.
 
  http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre
  d/TextInputFormat.html
 
  If not I guess you can write your own InputFormat class.
 
  --
  Harish Mallipeddi
  http://blog.poundbang.in
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals



RE: Large size Text file split

2009-06-10 Thread Wenrui Guo
I don't understand the internal logic of the FileSplit and Mapper.

By my understanding, I think FileInputFormat is the actual class that
takes care of the file spliting. So it's reasonable if one large file is
splited into 5 smaller parts, each parts is less than 2GB (since we
specify the numberOfSplit is 5).

However, the FileSplit is rough edges, so mapper 1 which takes the split
1 as input omit the incomplete parts at the end of split 1, then mapper
2 will continue to read that incomplete part then add the remaining part
of split 2?

Take this as example:

The original file is:

1::122::5::838985046 (CRLF)
1::185::5::838983525 (CRLF)
1::231::5::838983392 (CRLF)

Assume number of split is 2, then the above content is divied into two
part:

Split 1:
1::122::5::838985046 (CRLF)
1::185::5::8
 

Split 2:
38983525 (CRLF)
1::231::5::838983392 (CRLF)

Afterwards, Mapper 1 takes split 1 as input, but after eat the line
1::122::5::838985046, it found the remaining part is not a complete
record, then Mapper 1 bypass it, but Mapper 2 will read this and add it
ahead of first line of Split 2 to compose a valid record.

Is it correct ? If it is, which class implements the above logic?

BR/anderson

-Original Message-
From: Aaron Kimball [mailto:aa...@cloudera.com] 
Sent: Thursday, June 11, 2009 11:49 AM
To: core-user@hadoop.apache.org
Subject: Re: Large size Text file split

The FileSplit boundaries are rough edges -- the mapper responsible for
the previous split will continue until it finds a full record, and the
next mapper will read ahead and only start on the first record boundary
after the byte offset.
- Aaron

On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo wenrui@ericsson.com
wrote:

 I think the default TextInputFormat can meet my requirement. However, 
 even if the JavaDoc of TextInputFormat says the TextInputFormat could 
 divide input file as text lines which ends with CRLF. But I'd like to 
 know if the FileSplit size is not N times of line length, what will be

 happen eventually?

 BR/anderson

 -Original Message-
 From: jason hadoop [mailto:jason.had...@gmail.com]
 Sent: Wednesday, June 10, 2009 8:39 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Large size Text file split

 There is always NLineInputFormat. You specify the number of lines per 
 split.
 The key is the position of the line start in the file, value is the 
 line itself.
 The parameter mapred.line.input.format.linespermap controls the number

 of lines per split

 On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi  
 harish.mallipe...@gmail.com wrote:

  On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo 
  wenrui@ericsson.com
  wrote:
 
   Hi, all
  
   I have a large csv file ( larger than 10 GB ), I'd like to use a 
   certain InputFormat to split it into smaller part thus each Mapper

   can deal with piece of the csv file. However, as far as I know, 
   FileInputFormat only cares about byte size of file, that is, the 
   class can divide the csv file as many part, and maybe some part is
 not a well-format CVS file.
   For example, one line of the CSV file is not terminated with CRLF,

   or maybe some text is trimed.
  
   How to ensure each FileSplit is a smaller valid CSV file using a 
   proper InputFormat?
  
   BR/anderson
  
 
  If all you care about is the splits occurring at line boundaries, 
  then

  TextInputFormat will work.
 
  http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/map
  re
  d/TextInputFormat.html
 
  If not I guess you can write your own InputFormat class.
 
  --
  Harish Mallipeddi
  http://blog.poundbang.in
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals