Hi,
I have a task to process large quantities of files by converting them
into other formats. Each file is processed as a whole and converted to a
target format. Since there are 100's of GB of data I thought it suitable
for Hadoop, but the problem is, I don't think the files can be broken
apart
: Wednesday, January 21, 2009 08:08
To: core-user@hadoop.apache.org
Subject: Suitable for Hadoop?
Hi,
I have a task to process large quantities of files by converting them into
other formats. Each file is processed as a whole and converted to a target
format. Since there are 100's of GB of data I thought
-
From: Zak, Richard [USA] [mailto:zak_rich...@bah.com]
Sent: Wednesday, January 21, 2009 6:42 AM
To: core-user@hadoop.apache.org
Subject: RE: Suitable for Hadoop?
You can do that. I did a Map/Reduce job for about 6 GB of PDFs to
concatenate them, and the New York times used Hadoop to process
in a beneficial manner, and the distributed part is very
helpful!
Richard J. Zak
-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com]
Sent: Wednesday, January 21, 2009 08:08
To: core-user@hadoop.apache.org
Subject: Suitable for Hadoop?
Hi,
I have a task to process
@hadoop.apache.org
Subject: RE: Suitable for Hadoop?
You can do that. I did a Map/Reduce job for about 6 GB of PDFs to
concatenate them, and the New York times used Hadoop to process a few TB of
PDFs.
What I would do is this:
- Use the iText library, a Java library for PDF manipulation (don't
Twensky [mailto:jim.twen...@gmail.com]
Sent: Wednesday, January 21, 2009 11:47 AM
To: core-user@hadoop.apache.org
Subject: Re: Suitable for Hadoop?
Ricky,
Hadoop was formerly optimized for large files, usually files of size larger
than one input split. However, there is an input format called
@hadoop.apache.org
Subject: RE: Suitable for Hadoop?
Jim, thanks for your explanation. But isn't isSplittable an option in
writing output rather than reading input ?
There are two phases.
1) Upload the data from local file to HDFS. Is there an option in the
hadoop fs copy to pack multiple small files