Very good input not to sent the "original xml" over to the reducers. For
the JobConf.setNumReduceTasks(n) isn't that just a hint but the real
number will be determined based on the Partitioner I use, which will be
the default HashPartioner? One other thought I had, what will happen if
the values list sent to a single Reducer is to big to fit into memory?

/Reik



Gang Luo wrote:
> HI,
> you can control the number of reducers by JobConf.setNumReduceTasks(n). The 
> number of mappers is defined by (file size) / (split size). By default the 
> split size is 64MB. Since you dataset is not very large, there should be no 
> big difference if you change these. 
>
> if you are only interested in the number of blocks per email address, you 
> don't need to send the "original xml" as the value in the intermediate 
> result. This can reduce the amount of data sent from mappers to reducers. Use 
> combiner to pre-aggregate the data may also help.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Reik Schatz <reik.sch...@bwin.org>
> 收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> 发送日期: 2010/3/17 (周三) 5:04:33 上午
> 主   题: optimization help needed
>
> Preparing a Hadoop presentation here. For demonstration I start up a 5 
> machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 
> launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into 
> HDFS. The Mapper will receive a XML block as the key, select a email address 
> from the xml and use this as the key for the reducer and the orginal xml as 
> the value. The Reducer just aggregates the number of XML blocks per email 
> address.
>
> Running this on the cluster takes about 2:30 min. The frameworks uses 8 
> Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in 
> the file. How can I speed up processing time? One thing I can think of, is to 
> have more than just 2 email addresses in the sample document to be able to 
> use more than 2 reducers in parallel. Why did the framework choose to use 8 
> mappers and not more? Maybe my sample data is too small to benefit from 
> parallel processing. 
> Thanks in advance
>
>
>
>       
>   

-- 

*Reik Schatz*
Technical Lead, Platform
P: +46 8 562 470 00
M: +46 76 25 29 872
F: +46 8 562 470 01
E: reik.sch...@bwin.org <mailto:reik.sch...@bwin.org>
*/bwin/* Games AB
Klarabergsviadukten 82,
111 64 Stockholm, Sweden

[This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient (or have received this e-mail in
error) please notify the sender immediately and destroy this e-mail. Any
unauthorised copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.]

Reply via email to