Hmmm ... >From a space efficiency perspective, given HDFS (with large block size) is >expecting large files, is Hadoop optimized for processing large number of >small files ? Does each file take up at least 1 block ? or multiple files can >sit on the same block.
Rgds, Ricky -----Original Message----- From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] Sent: Wednesday, January 21, 2009 6:42 AM To: core-user@hadoop.apache.org Subject: RE: Suitable for Hadoop? You can do that. I did a Map/Reduce job for about 6 GB of PDFs to concatenate them, and the New York times used Hadoop to process a few TB of PDFs. What I would do is this: - Use the iText library, a Java library for PDF manipulation (don't know what you would use for reading Word docs) - Don't use any Reducers - Have the input be a text file with the directory(ies) to process, since the mapper takes in file contents (and you don't want to read in one line of binary) - Have the map process all contents for that one given directory from the input text file - Break down the documents into more directories to go easier on the memory - Use Amazon's EC2, and the scripts in <hadoop_dir>/src/contrib/ec2/bin/ (there is a script which passes environment variables to launched instances, modify the script to allow Hadoop to use more memory by setting the HADOOP_HEAPSIZE environment variable and having the variable properly passed) I realize this isn't the strong point of Map/Reduce or Hadoop, but it still uses the HDFS in a beneficial manner, and the distributed part is very helpful! Richard J. Zak -----Original Message----- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, January 21, 2009 08:08 To: core-user@hadoop.apache.org Subject: Suitable for Hadoop? Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought it suitable for Hadoop, but the problem is, I don't think the files can be broken apart and processed. For example, how would mapreduce work to convert a Word Document to PDF if the file is reduced to blocks? I'm not sure that's possible, or is it? thanks for any advice Darren