I try to use Hive merge options to merge the smallfiles into a large files using the following query. It is working well except that I cannot control the output file size. I cannot explain why the output files are always 256MB using the following hive.merge.size.per.task and hive.merge.smallfiles.avgsize settings. Tried 56MB for hive.merge.size.per.task, the size is still 256MB.
"omniture_hit" is an uncompressed CSV file format hive table. I want to convert it into RCFile format. The problem is that there will a lot of small RCFiles created which are much smaller than our default block size 128M if I just simple select * and insert into the new table. Another problem is that I want to change hive.io.rcfile.record.size to 8MB to see if there is more compression ratio for my data. But the result seems similar compared with 4MB. The data pattern could be like that as RCFile paper said. But how can I verify if my setting to 8MB works? Thanks. Ben SET hive.exec.compress.output=true; SET hive.exec.compress.intermediate=true; set hive.merge.size.per.task=28*1024*1024; set hive.merge.smallfiles.avgsize=100000000; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions.pernode=10000; SET hive.exec.max.dynamic.partitions=10000; SET hive.exec.max.created.files=150000; create table omniture_hit_rc like omniture_hit; insert overwrite table omniture_hit_rc partition (local_dt) select * from omniture_hit where local_dt>='2012-06-01';