You can reduce mapreduce.input.fileinputformat.split.maxsize  to increase
the number of mappers (more splits).

However your issue is likely due as David alluded, to the compression.
Depending on how your files are organized and compressed , Hadoop my not be
able to split them to feed several mappers :
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage

Stephane

On Sun, Dec 7, 2014 at 10:50 PM, david1990...@163.com <david1990...@163.com>
wrote:

> You metioned that 'The dataformat in src_click is 
> org.apache.hadoop.mapred.TextInputFormat
> and has been compressed .'
> Can I ask you that which compression codec do you use?
>
>
> *发件人:* 老赵 <laozh...@sina.cn>
> *发送时间:* 2014-12-08 13:12
> *收件人:* user <user@hive.apache.org>
> *主题:* hive sql tune
>
> Hello,
>
> I am working for a Telecommunicaton Service Provider Company,So I can
> access the view logs of different users from a specific area.
>
> Now I want to query the top 1000 PV sites.
>
> I wrote a UDF named : parse_top_domain to get the top domain of a
> host,like www1.google.com.hk -> google.com.hk
>
> and i use the below hql:
>
> add jar hive_func.jar;
>
> create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';
>
> select parse_top_domain(parse_url(url,'HOST')),count(*) c from
>
> src_click
>
> where date = 20141204
>
> and parse_top_domain(parse_url(url,'HOST')) !=''
>
> group by parse_top_domain(parse_url(url,'HOST'))
>
> order by c desc;
>
> The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat
> and has been compressed .
>
> This hql will generate 8 mappers and 1 reducer,for the data is very big
> ,it is very slow .
>
> I hope it can be generate much more mappers so I set this :set
> mapred.map.tasks=100;
>
> But this has no effect.
>
> So any one can help me or give some suggestions .
>
> Any replay is appreciated.
>
> --------------------------------
>
> ZHAO
>
> laozh...@sina.cn
>
>

Reply via email to