Watching a second job with more reduce task running looks like the in-memory merges are working correctly with compression.

The task I was watching failed and was running again it Shuffle all the map output files then started the merged after all was copied so non was merged in memory it was closed before the merging started. If it helps the name of the output files is intermediate.x and is stored in folder mapred/local/job-taskname/intermediate.x while the in-memory merges are stored mapred/local/taskTracker/jobcache/job-name/taskname/

The non compressed ones are the intermediate.x file above.

Billy


"Chris Douglas" <chri...@yahoo-inc.com> wrote in message news:9bb78c3a-efab-45c3-8cc3-25aab60df...@yahoo-inc.com...
My problem is the output from merging the intermediate map output files is not compresses so I lose all the benefit of compressing the map file output to save disk space because the merged map output files are no longer compressed.

It should still be compressed, unless there's some bizarre regression. More segments will be around simultaneously (since the segments not yet merged are still on disk), which clearly puts pressure on intermediate storage, but if the map outputs are compressed, then the merged map outputs at the reduce must also be compressed. There's no place in the intermediate format to store compression metadata, so either all are or none are. Intermediate merges should also follow the compression spec of the initiating merger, too (o.a.h.mapred.Merger: 447).

How are you concluding that the intermediate output is compressed from the map, but not in the reduce? -C


----- Original Message ----- From: "Chris Douglas" <chrisdo-ZXvpkYn067l8UrSeD/g...@public.gmane.org
>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org>
Sent: Tuesday, March 17, 2009 12:33 AM
Subject: Re: intermediate results not getting compressed


I am running 0.19.1-dev, r744282. I have searched the issues but found nothing about the compression.

AFAIK, there are no open issues that prevent intermediate compression from working. The following might be useful:

http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression

Should the intermediate results not be compressed also if the map output files are set to be compressed?

These are controlled by separate options.

FileOutputFormat::setCompressOutput enables/disables compression on the final output JobConf::setCompressMapOutput enables/disables compression of the intermediate output

If not then why do we have the map compression option just to save network traffic?

That's part of it. Also to save on disk bandwidth and intermediate space. -C






Reply via email to