You metioned that 'The dataformat in src_click is 
org.apache.hadoop.mapred.TextInputFormat and has been compressed .'
Can I ask you that which compression codec do you use?
 
发件人: 老赵
发送时间: 2014-12-08 13:12
收件人: user
主题: hive sql tune
Hello,
I am working for a Telecommunicaton Service Provider Company,So I can access 
the view logs of different users from a specific area.
Now I want to query the top 1000 PV sites.
I wrote a UDF named : parse_top_domain to get the top domain of a host,like 
www1.google.com.hk -> google.com.hk
and i use the below hql:
add jar hive_func.jar;
create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';
select parse_top_domain(parse_url(url,'HOST')),count(*) c from 
src_click
where date = 20141204
and parse_top_domain(parse_url(url,'HOST')) !=''
group by parse_top_domain(parse_url(url,'HOST'))
order by c desc;
The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has 
been compressed .
This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is 
very slow .
I hope it can be generate much more mappers so I set this :set 
mapred.map.tasks=100;
But this has no effect.
So any one can help me or give some suggestions .
Any replay is appreciated.
--------------------------------
ZHAO
laozh...@sina.cn

Reply via email to