Re: hive sql tune
You can reduce mapreduce.input.fileinputformat.split.maxsize to increase the number of mappers (more splits). However your issue is likely due as David alluded, to the compression. Depending on how your files are organized and compressed , Hadoop my not be able to split them to feed several mappers : https://cwiki.apache.org/confluence/display/Hive/CompressedStorage Stephane On Sun, Dec 7, 2014 at 10:50 PM, david1990...@163.com wrote: > You metioned that 'The dataformat in src_click is > org.apache.hadoop.mapred.TextInputFormat > and has been compressed .' > Can I ask you that which compression codec do you use? > > > *发件人:* 老赵 > *发送时间:* 2014-12-08 13:12 > *收件人:* user > *主题:* hive sql tune > > Hello, > > I am working for a Telecommunicaton Service Provider Company,So I can > access the view logs of different users from a specific area. > > Now I want to query the top 1000 PV sites. > > I wrote a UDF named : parse_top_domain to get the top domain of a > host,like www1.google.com.hk -> google.com.hk > > and i use the below hql: > > add jar hive_func.jar; > > create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain'; > > select parse_top_domain(parse_url(url,'HOST')),count(*) c from > > src_click > > where date = 20141204 > > and parse_top_domain(parse_url(url,'HOST')) !='' > > group by parse_top_domain(parse_url(url,'HOST')) > > order by c desc; > > The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat > and has been compressed . > > This hql will generate 8 mappers and 1 reducer,for the data is very big > ,it is very slow . > > I hope it can be generate much more mappers so I set this :set > mapred.map.tasks=100; > > But this has no effect. > > So any one can help me or give some suggestions . > > Any replay is appreciated. > > > > ZHAO > > laozh...@sina.cn > >
回复: hive sql tune
You metioned that 'The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed .' Can I ask you that which compression codec do you use? 发件人: 老赵 发送时间: 2014-12-08 13:12 收件人: user 主题: hive sql tune Hello, I am working for a Telecommunicaton Service Provider Company,So I can access the view logs of different users from a specific area. Now I want to query the top 1000 PV sites. I wrote a UDF named : parse_top_domain to get the top domain of a host,like www1.google.com.hk -> google.com.hk and i use the below hql: add jar hive_func.jar; create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain'; select parse_top_domain(parse_url(url,'HOST')),count(*) c from src_click where date = 20141204 and parse_top_domain(parse_url(url,'HOST')) !='' group by parse_top_domain(parse_url(url,'HOST')) order by c desc; The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed . This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is very slow . I hope it can be generate much more mappers so I set this :set mapred.map.tasks=100; But this has no effect. So any one can help me or give some suggestions . Any replay is appreciated. ZHAO laozh...@sina.cn
hive sql tune
Hello,I am working for a Telecommunicaton Service Provider Company,So I can access the view logs of different users from a specific area.Now I want to query the top 1000 PV sites.I wrote a UDF named : parse_top_domain to get the top domain of a host,like www1.google.com.hk -> google.com.hkand i use the below hql:add jar hive_func.jar;create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';select parse_top_domain(parse_url(url,'HOST')),count(*) c from src_clickwhere date = 20141204and parse_top_domain(parse_url(url,'HOST')) !=''group by parse_top_domain(parse_url(url,'HOST'))order by c desc;The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed .This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is very slow .I hope it can be generate much more mappers so I set this :set mapred.map.tasks=100;But this has no effect.So any one can help me or give some suggestions .Any replay is appreciated.zhaolaozh...@sina.cn