As I understand it, zip isn't splittable format. You might consider using
bzip2 or another splittable compression format.
Alternatively, you could have one job that does the decompression chained
to another that does the.processing to get the parallelization.
On Mar 19, 2012 8:26 PM, Andrew
How are your splitting the zip right now? Do you have multiple mappers and
each mapper starts at the beginning of the zip and goes to the point it cares
about or do you just have one mapper? If you are doing it the first way you
may want to increase your replication factor. Alternatively you
Hi,
I have a large (~300 gig) zip of images that I need to process. My
current workflow is to copy the zip to HDFS, use a custom input format
to read the zip entries, do the processing in a map, and then generate
a processing report in the reduce. I'm struggling to tune params right
now with my