Hi All, I am trying to merge gzip compressed RCFile output to one single file per partition. Hive version is 0.10:
SET hive.exec.compress.intermediate=true; SET mapred.compress.map.output=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET mapred.output.compression.type=BLOCK; SET hive.merge.mapfiles=true; SET hive.merge.mapredfiles=true; SET hive.merge.size.per.task=256000000; SET hive.merge.smallfiles.avgsize=256000000; After adding another partition with "INSERT OVERWRITE TABLE ... PARTITION (...) SELECT ...", the output of the Hive job (1 mapreduce job + 1 map-only merge job) looks like this: 000000_0 file 8.15 MB 000001_0 file 7.88 MB 000002_0 file 5.2 MB ... 000013_0 file 700.56 KB 000014_0 file 574.59 KB Why is the largest file more than 10 times bigger than the smallest? Why are they sorted by filesize descending? And why is it not 1 single file? I tested the same table and Statement also with STORED AS SEQUENCEFILE, and the result was 1 single output file. Regards Matthias