Hello,I am working for a Telecommunicaton Service Provider Company,So I can
access the view logs of different users from a specific area.Now I want to
query the top 1000 PV sites.I wrote a UDF named : parse_top_domain to get the
top domain of a host,like www1.google.com.hk -> google.com.hkand i use the
below hql:add jar hive_func.jar;create temporary function parse_top_domain as
'com.xxx.GetTopLevelDomain';select
parse_top_domain(parse_url(url,'HOST')),count(*) c from src_clickwhere date =
20141204and parse_top_domain(parse_url(url,'HOST')) !=''group by
parse_top_domain(parse_url(url,'HOST'))order by c desc;The dataformat in
src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed
.This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is
very slow .I hope it can be generate much more mappers so I set this :set
mapred.map.tasks=100;But this has no effect.So any one can help me or give some
suggestions .Any replay is
appreciated.--------------------------------zhaolaozh...@sina.cn