Watching a second job with more reduce task running looks like the in-memory
merges are working correctly with compression.
The task I was watching failed and was running again it Shuffle all the map
output files then started the merged after all was copied so non was merged
in memory it was closed before the merging started.
If it helps the name of the output files is intermediate.x and is stored in
folder mapred/local/job-taskname/intermediate.x
while the in-memory merges are stored
mapred/local/taskTracker/jobcache/job-name/taskname/
The non compressed ones are the intermediate.x file above.
Billy
"Chris Douglas" <chri...@yahoo-inc.com> wrote in
message news:9bb78c3a-efab-45c3-8cc3-25aab60df...@yahoo-inc.com...
My problem is the output from merging the intermediate map output files
is not compresses so I lose all the benefit of compressing the map file
output to save disk space because the merged map output files are no
longer compressed.
It should still be compressed, unless there's some bizarre regression.
More segments will be around simultaneously (since the segments not yet
merged are still on disk), which clearly puts pressure on intermediate
storage, but if the map outputs are compressed, then the merged map
outputs at the reduce must also be compressed. There's no place in the
intermediate format to store compression metadata, so either all are or
none are. Intermediate merges should also follow the compression spec of
the initiating merger, too (o.a.h.mapred.Merger: 447).
How are you concluding that the intermediate output is compressed from
the map, but not in the reduce? -C
----- Original Message ----- From: "Chris Douglas"
<chrisdo-ZXvpkYn067l8UrSeD/g...@public.gmane.org
>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To:
<core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org>
Sent: Tuesday, March 17, 2009 12:33 AM
Subject: Re: intermediate results not getting compressed
I am running 0.19.1-dev, r744282. I have searched the issues but
found nothing about the compression.
AFAIK, there are no open issues that prevent intermediate compression
from working. The following might be useful:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
Should the intermediate results not be compressed also if the map
output files are set to be compressed?
These are controlled by separate options.
FileOutputFormat::setCompressOutput enables/disables compression on
the final output
JobConf::setCompressMapOutput enables/disables compression of the
intermediate output
If not then why do we have the map compression option just to save
network traffic?
That's part of it. Also to save on disk bandwidth and intermediate
space. -C