I try to use Hive merge options to merge the smallfiles into a large files
using the following query. It is working well except that I cannot control
the output file size. I cannot explain why the output files are always
256MB using the following hive.merge.size.per.task and
hive.merge.smallfiles.avgsize
settings. Tried 56MB for hive.merge.size.per.task, the size is still 256MB.

"omniture_hit" is an uncompressed CSV file format hive table. I want to
convert it into RCFile format. The problem is that there will a lot of
small RCFiles created which are much smaller than our default block size
128M if I just simple select * and insert into the new table.

Another problem is that I want to change hive.io.rcfile.record.size to 8MB
to see if there is more compression ratio for my data. But the result seems
similar compared with 4MB. The data pattern could be like that as RCFile
paper said. But how can I verify if my setting to 8MB works?

Thanks.

Ben

SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate=true;

set hive.merge.size.per.task=28*1024*1024;
set hive.merge.smallfiles.avgsize=100000000;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET hive.exec.max.dynamic.partitions=10000;
SET hive.exec.max.created.files=150000;

create table omniture_hit_rc like omniture_hit;

insert overwrite table omniture_hit_rc partition (local_dt) select *
from omniture_hit where local_dt>='2012-06-01';

Reply via email to