Hello,I am working for a Telecommunicaton Service Provider Company,So I can 
access the view logs of different users from a specific area.Now I want to 
query the top 1000 PV sites.I wrote a UDF named : parse_top_domain to get the 
top domain of a host,like www1.google.com.hk -> google.com.hkand i use the 
below hql:add jar hive_func.jar;create temporary function parse_top_domain as 
'com.xxx.GetTopLevelDomain';select 
parse_top_domain(parse_url(url,'HOST')),count(*) c from src_clickwhere date = 
20141204and parse_top_domain(parse_url(url,'HOST')) !=''group by 
parse_top_domain(parse_url(url,'HOST'))order by c desc;The dataformat in 
src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed 
.This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is 
very slow .I hope it can be generate much more mappers so I set this :set 
mapred.map.tasks=100;But this has no effect.So any one can help me or give some 
suggestions .Any replay is 
appreciated.--------------------------------zhaolaozh...@sina.cn

Reply via email to