Hmmm ...

>From a space efficiency perspective, given HDFS (with large block size) is 
>expecting large files, is Hadoop optimized for processing large number of 
>small files ?  Does each file take up at least 1 block ? or multiple files can 
>sit on the same block.

Rgds,
Ricky
-----Original Message-----
From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] 
Sent: Wednesday, January 21, 2009 6:42 AM
To: core-user@hadoop.apache.org
Subject: RE: Suitable for Hadoop?

You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
concatenate them, and the New York times used Hadoop to process a few TB of
PDFs.

What I would do is this:
- Use the iText library, a Java library for PDF manipulation (don't know
what you would use for reading Word docs)
- Don't use any Reducers
- Have the input be a text file with the directory(ies) to process, since
the mapper takes in file contents (and you don't want to read in one line of
binary)
- Have the map process all contents for that one given directory from the
input text file
- Break down the documents into more directories to go easier on the memory
- Use Amazon's EC2, and the scripts in <hadoop_dir>/src/contrib/ec2/bin/
(there is a script which passes environment variables to launched instances,
modify the script to allow Hadoop to use more memory by setting the
HADOOP_HEAPSIZE environment variable and having the variable properly
passed)

I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
uses the HDFS in a beneficial manner, and the distributed part is very
helpful!


Richard J. Zak

-----Original Message-----
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: Wednesday, January 21, 2009 08:08
To: core-user@hadoop.apache.org
Subject: Suitable for Hadoop?

Hi,
  I have a task to process large quantities of files by converting them into
other formats. Each file is processed as a whole and converted to a target
format. Since there are 100's of GB of data I thought it suitable for
Hadoop, but the problem is, I don't think the files can be broken apart and
processed. For example, how would mapreduce work to convert a Word Document
to PDF if the file is reduced to blocks? I'm not sure that's possible, or is
it?

thanks for any advice
Darren

Reply via email to