Re: hive sql tune

2014-12-29 Thread Stéphane Verlet
You can reduce mapreduce.input.fileinputformat.split.maxsize  to increase
the number of mappers (more splits).

However your issue is likely due as David alluded, to the compression.
Depending on how your files are organized and compressed , Hadoop my not be
able to split them to feed several mappers :
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage

Stephane

On Sun, Dec 7, 2014 at 10:50 PM, david1990...@163.com david1990...@163.com
wrote:

 You metioned that 'The dataformat in src_click is 
 org.apache.hadoop.mapred.TextInputFormat
 and has been compressed .'
 Can I ask you that which compression codec do you use?


 *发件人:* 老赵 laozh...@sina.cn
 *发送时间:* 2014-12-08 13:12
 *收件人:* user user@hive.apache.org
 *主题:* hive sql tune

 Hello,

 I am working for a Telecommunicaton Service Provider Company,So I can
 access the view logs of different users from a specific area.

 Now I want to query the top 1000 PV sites.

 I wrote a UDF named : parse_top_domain to get the top domain of a
 host,like www1.google.com.hk - google.com.hk

 and i use the below hql:

 add jar hive_func.jar;

 create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';

 select parse_top_domain(parse_url(url,'HOST')),count(*) c from

 src_click

 where date = 20141204

 and parse_top_domain(parse_url(url,'HOST')) !=''

 group by parse_top_domain(parse_url(url,'HOST'))

 order by c desc;

 The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat
 and has been compressed .

 This hql will generate 8 mappers and 1 reducer,for the data is very big
 ,it is very slow .

 I hope it can be generate much more mappers so I set this :set
 mapred.map.tasks=100;

 But this has no effect.

 So any one can help me or give some suggestions .

 Any replay is appreciated.

 

 ZHAO

 laozh...@sina.cn




hive sql tune

2014-12-07 Thread 老赵
Hello,I am working for a Telecommunicaton Service Provider Company,So I can 
access the view logs of different users from a specific area.Now I want to 
query the top 1000 PV sites.I wrote a UDF named : parse_top_domain to get the 
top domain of a host,like www1.google.com.hk - google.com.hkand i use the 
below hql:add jar hive_func.jar;create temporary function parse_top_domain as 
'com.xxx.GetTopLevelDomain';select 
parse_top_domain(parse_url(url,'HOST')),count(*) c from src_clickwhere date = 
20141204and parse_top_domain(parse_url(url,'HOST')) !=''group by 
parse_top_domain(parse_url(url,'HOST'))order by c desc;The dataformat in 
src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed 
.This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is 
very slow .I hope it can be generate much more mappers so I set this :set 
mapred.map.tasks=100;But this has no effect.So any one can help me or give some 
suggestions .Any replay is 
appreciated.zhaolaozh...@sina.cn


回复: hive sql tune

2014-12-07 Thread david1990...@163.com
You metioned that 'The dataformat in src_click is 
org.apache.hadoop.mapred.TextInputFormat and has been compressed .'
Can I ask you that which compression codec do you use?
 
发件人: 老赵
发送时间: 2014-12-08 13:12
收件人: user
主题: hive sql tune
Hello,
I am working for a Telecommunicaton Service Provider Company,So I can access 
the view logs of different users from a specific area.
Now I want to query the top 1000 PV sites.
I wrote a UDF named : parse_top_domain to get the top domain of a host,like 
www1.google.com.hk - google.com.hk
and i use the below hql:
add jar hive_func.jar;
create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';
select parse_top_domain(parse_url(url,'HOST')),count(*) c from 
src_click
where date = 20141204
and parse_top_domain(parse_url(url,'HOST')) !=''
group by parse_top_domain(parse_url(url,'HOST'))
order by c desc;
The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has 
been compressed .
This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is 
very slow .
I hope it can be generate much more mappers so I set this :set 
mapred.map.tasks=100;
But this has no effect.
So any one can help me or give some suggestions .
Any replay is appreciated.

ZHAO
laozh...@sina.cn