Very good input not to sent the "original xml" over to the reducers. For the JobConf.setNumReduceTasks(n) isn't that just a hint but the real number will be determined based on the Partitioner I use, which will be the default HashPartioner? One other thought I had, what will happen if the values list sent to a single Reducer is to big to fit into memory?
/Reik Gang Luo wrote: > HI, > you can control the number of reducers by JobConf.setNumReduceTasks(n). The > number of mappers is defined by (file size) / (split size). By default the > split size is 64MB. Since you dataset is not very large, there should be no > big difference if you change these. > > if you are only interested in the number of blocks per email address, you > don't need to send the "original xml" as the value in the intermediate > result. This can reduce the amount of data sent from mappers to reducers. Use > combiner to pre-aggregate the data may also help. > > -Gang > > > > > ----- 原始邮件 ---- > 发件人: Reik Schatz <reik.sch...@bwin.org> > 收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org> > 发送日期: 2010/3/17 (周三) 5:04:33 上午 > 主 题: optimization help needed > > Preparing a Hadoop presentation here. For demonstration I start up a 5 > machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 > launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into > HDFS. The Mapper will receive a XML block as the key, select a email address > from the xml and use this as the key for the reducer and the orginal xml as > the value. The Reducer just aggregates the number of XML blocks per email > address. > > Running this on the cluster takes about 2:30 min. The frameworks uses 8 > Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in > the file. How can I speed up processing time? One thing I can think of, is to > have more than just 2 email addresses in the sample document to be able to > use more than 2 reducers in parallel. Why did the framework choose to use 8 > mappers and not more? Maybe my sample data is too small to benefit from > parallel processing. > Thanks in advance > > > > > -- *Reik Schatz* Technical Lead, Platform P: +46 8 562 470 00 M: +46 76 25 29 872 F: +46 8 562 470 01 E: reik.sch...@bwin.org <mailto:reik.sch...@bwin.org> */bwin/* Games AB Klarabergsviadukten 82, 111 64 Stockholm, Sweden [This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.]