How are your splitting the zip right now?  Do you have multiple mappers and 
each mapper starts at the beginning of the zip and goes to the point it cares 
about or do you just have one mapper?  If you are doing it the first way you 
may want to increase your replication factor.  Alternatively you could use 
multiple zip files, one per mapper that you want to launch.

--Bobby Evans

On 3/19/12 7:26 PM, "Andrew McNair" <andrew.mcn...@gmail.com> wrote:

Hi,

I have a large (~300 gig) zip of images that I need to process. My
current workflow is to copy the zip to HDFS, use a custom input format
to read the zip entries, do the processing in a map, and then generate
a processing report in the reduce. I'm struggling to tune params right
now with my cluster to make everything run smoothly, but I'm also
worried that I'm missing a better way of processing.

Does anybody have suggestions for how to make the processing of a zip
more parallel? The only other idea I had was uploading the zip as a
sequence file, but that proved incredibly slow (~30 hours on my 3 node
cluster to upload).

Thanks in advance.

-Andrew

Reply via email to