Re: Hive produces very small files despite hive.merge...=true settings

yongqiang he Fri, 19 Nov 2010 13:41:05 -0800

I can not think this could be the cause.

The problem should be: your files can not be merged. I mean the file
size is bigger than the split size


On Friday, November 19, 2010, Leo Alekseyev <dnqu...@gmail.com> wrote:
> Folks, thanks for your help.  I've narrowed the problem down to
> compression.  When I set hive.exec.compress.output=false, merges
> proceed as expected.  When compression is on, the merge job doesn't
> seem to actually merge, it just spits out the input.
>
> On Fri, Nov 19, 2010 at 10:51 AM, yongqiang he <heyongqiang...@gmail.com> 
> wrote:
>> These are the parameters that control the behavior. (Try to set them
>> to different values if it does not work in your environment.)
>>
>> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
>> set mapred.min.split.size.per.node=1000000000;
>> set mapred.min.split.size.per.rack=1000000000;
>> set mapred.max.split.size=1000000000;
>>
>> set hive.merge.size.per.task=1000000000;
>> set hive.merge.smallfiles.avgsize=1000000000;
>> set hive.merge.size.smallfiles.avgsize=1000000000;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>>
>>
>> The output size of the second job is also controlled by the split
>> size, as shown in the first 4 lines.
>>
>>
>> On Fri, Nov 19, 2010 at 10:22 AM, Leo Alekseyev <dnqu...@gmail.com> wrote:
>>> I'm using Hadoop 0.20.2.  Merge jobs (with static partitions) have
>>> worked for me in the past.  Again, what's strange here is with the
>>> latest Hive build the merge stage appears to run, but it doesn't
>>> actually merge -- it's a quick map-only job that, near as I can tell,
>>> doesn't do anything.
>>>
>>> On Fri, Nov 19, 2010 at 6:14 AM, Dave Brondsema <dbronds...@geek.net> wrote:
>>>> What version of Hadoop are you on?
>>>>
>>>> On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev <dnqu...@gmail.com> wrote:
>>>>>
>>>>> I thought I was running Hive with those changes merged in, but to make
>>>>> sure, I built the latest trunk version.  The behavior changed somewhat
>>>>> (as in, it runs 2 stages instead of 1), but it still generates the
>>>>> same number of files (# of files generated is equal to the number of
>>>>> the original mappers, so I have no idea what the second stage is
>>>>> actually doing).
>>>>>
>>>>> See below for query / explain query.  Stage 1 runs always; Stage 3
>>>>> runs if hive.merge.mapfiles=true is set, but it still generates lots
>>>>> of small files.
>>>>>
>>>>> The query is kind of large, but in essence it's simply
>>>>> insert overwrite table foo partition(bar) select [columns] from
>>>>> [table] tablesample(bucket 1 out of 10000 on rand()) where
>>>>> [conditions].
>>>>>
>>>>>
>>>>> explain insert overwrite table hbase_prefilter3_us_sample partition
>>>>> (ds) select
>>>>> server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip,
>>>>> 'COUNTRY_CODE',  './GeoIP.dat'),'',ds from alogs_master
>>>>> TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) am_s where
>>>>> am_s.ds='2010-11-05' and am_s.request_url rlike
>>>>> '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and
>>>>> geoip_int(am_s.client_ip, 'COUNTRY_CODE',  './GeoIP.dat')='US';
>>>>> OK
>>>>> ABSTRACT SYNTAX TREE:
>>>>>  (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1
>>>>> 10000 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION
>>>>> (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds))))
>>>>> (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR
>>>>> (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL
>>>>> time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL
>>>>> server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL
>>>>> request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR
>>>>> (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url
>>>>> (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL
>>>>> user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR
>>>>> (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE'
>>>>>

Re: Hive produces very small files despite hive.merge...=true settings

Reply via email to