I can not think this could be the cause. The problem should be: your files can not be merged. I mean the file size is bigger than the split size
On Friday, November 19, 2010, Leo Alekseyev <dnqu...@gmail.com> wrote: > Folks, thanks for your help. I've narrowed the problem down to > compression. When I set hive.exec.compress.output=false, merges > proceed as expected. When compression is on, the merge job doesn't > seem to actually merge, it just spits out the input. > > On Fri, Nov 19, 2010 at 10:51 AM, yongqiang he <heyongqiang...@gmail.com> > wrote: >> These are the parameters that control the behavior. (Try to set them >> to different values if it does not work in your environment.) >> >> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; >> set mapred.min.split.size.per.node=1000000000; >> set mapred.min.split.size.per.rack=1000000000; >> set mapred.max.split.size=1000000000; >> >> set hive.merge.size.per.task=1000000000; >> set hive.merge.smallfiles.avgsize=1000000000; >> set hive.merge.size.smallfiles.avgsize=1000000000; >> set hive.exec.dynamic.partition.mode=nonstrict; >> >> >> The output size of the second job is also controlled by the split >> size, as shown in the first 4 lines. >> >> >> On Fri, Nov 19, 2010 at 10:22 AM, Leo Alekseyev <dnqu...@gmail.com> wrote: >>> I'm using Hadoop 0.20.2. Merge jobs (with static partitions) have >>> worked for me in the past. Again, what's strange here is with the >>> latest Hive build the merge stage appears to run, but it doesn't >>> actually merge -- it's a quick map-only job that, near as I can tell, >>> doesn't do anything. >>> >>> On Fri, Nov 19, 2010 at 6:14 AM, Dave Brondsema <dbronds...@geek.net> wrote: >>>> What version of Hadoop are you on? >>>> >>>> On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev <dnqu...@gmail.com> wrote: >>>>> >>>>> I thought I was running Hive with those changes merged in, but to make >>>>> sure, I built the latest trunk version. The behavior changed somewhat >>>>> (as in, it runs 2 stages instead of 1), but it still generates the >>>>> same number of files (# of files generated is equal to the number of >>>>> the original mappers, so I have no idea what the second stage is >>>>> actually doing). >>>>> >>>>> See below for query / explain query. Stage 1 runs always; Stage 3 >>>>> runs if hive.merge.mapfiles=true is set, but it still generates lots >>>>> of small files. >>>>> >>>>> The query is kind of large, but in essence it's simply >>>>> insert overwrite table foo partition(bar) select [columns] from >>>>> [table] tablesample(bucket 1 out of 10000 on rand()) where >>>>> [conditions]. >>>>> >>>>> >>>>> explain insert overwrite table hbase_prefilter3_us_sample partition >>>>> (ds) select >>>>> server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip, >>>>> 'COUNTRY_CODE', './GeoIP.dat'),'',ds from alogs_master >>>>> TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) am_s where >>>>> am_s.ds='2010-11-05' and am_s.request_url rlike >>>>> '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and >>>>> geoip_int(am_s.client_ip, 'COUNTRY_CODE', './GeoIP.dat')='US'; >>>>> OK >>>>> ABSTRACT SYNTAX TREE: >>>>> (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1 >>>>> 10000 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION >>>>> (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds)))) >>>>> (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR >>>>> (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL >>>>> time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL >>>>> server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL >>>>> request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR >>>>> (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url >>>>> (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL >>>>> user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR >>>>> (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE' >>>>>