Suitable for Hadoop?

2009-01-21 Thread Darren Govoni
Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought it suitable for Hadoop, but the problem is, I don't think the files can be broken apart

RE: Suitable for Hadoop?

2009-01-21 Thread Zak, Richard [USA]
: Wednesday, January 21, 2009 08:08 To: core-user@hadoop.apache.org Subject: Suitable for Hadoop? Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought

RE: Suitable for Hadoop?

2009-01-21 Thread Ricky Ho
- From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] Sent: Wednesday, January 21, 2009 6:42 AM To: core-user@hadoop.apache.org Subject: RE: Suitable for Hadoop? You can do that. I did a Map/Reduce job for about 6 GB of PDFs to concatenate them, and the New York times used Hadoop to process

RE: Suitable for Hadoop?

2009-01-21 Thread Darren Govoni
in a beneficial manner, and the distributed part is very helpful! Richard J. Zak -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, January 21, 2009 08:08 To: core-user@hadoop.apache.org Subject: Suitable for Hadoop? Hi, I have a task to process

Re: Suitable for Hadoop?

2009-01-21 Thread Jim Twensky
@hadoop.apache.org Subject: RE: Suitable for Hadoop? You can do that. I did a Map/Reduce job for about 6 GB of PDFs to concatenate them, and the New York times used Hadoop to process a few TB of PDFs. What I would do is this: - Use the iText library, a Java library for PDF manipulation (don't

RE: Suitable for Hadoop?

2009-01-21 Thread Ricky Ho
Twensky [mailto:jim.twen...@gmail.com] Sent: Wednesday, January 21, 2009 11:47 AM To: core-user@hadoop.apache.org Subject: Re: Suitable for Hadoop? Ricky, Hadoop was formerly optimized for large files, usually files of size larger than one input split. However, there is an input format called

RE: Suitable for Hadoop?

2009-01-21 Thread Zak, Richard [USA]
@hadoop.apache.org Subject: RE: Suitable for Hadoop? Jim, thanks for your explanation. But isn't isSplittable an option in writing output rather than reading input ? There are two phases. 1) Upload the data from local file to HDFS. Is there an option in the hadoop fs copy to pack multiple small files