Merge of compressed RCFile leads to uneven file sizes

Matthias Scherer Tue, 02 Dec 2014 05:17:35 -0800

Hi All,

I am trying to merge gzip compressed RCFile output to one single file per 
partition. Hive version is 0.10:


SET hive.exec.compress.intermediate=true;
SET mapred.compress.map.output=true;
SET 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;

SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=256000000;

After adding another partition with "INSERT OVERWRITE TABLE ... PARTITION (...) 
SELECT ...", the output of the Hive job (1 mapreduce job + 1 map-only merge 
job) looks like this:

000000_0             file         8.15 MB
000001_0             file         7.88 MB
000002_0             file         5.2 MB
...
000013_0             file         700.56 KB
000014_0             file         574.59 KB

Why is the largest file more than 10 times bigger than the smallest? Why are 
they sorted by filesize descending? And why is it not 1 single file?

I tested the same table and Statement also with STORED AS SEQUENCEFILE, and the 
result was 1 single output file.

Regards
Matthias

Merge of compressed RCFile leads to uneven file sizes

Reply via email to