On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi Tiago,
Hi there. First of all, sorry for the late reply --- I was investigating the issue further before replying. Just to make the whole thing clear(er), let me add some numbers and explain my problem. I have a ~80GB sequence file holding entries for 3 million users, regarding how many times those said users accessed aprox. 160 million distinct resources. There were aprox. 5000 million accesses for those resources. Each entry in the seq. file is encoded using google protocol buffers and compressed with gzip (don't ask...). I have to extract some metrics from this data. To start, I've chosen to rank resources based on the number of accesses. I thought that this would be a pretty simple MR application to write and run. A beefed up WordCount, if I may: Map: for each user: for each resource_id accessed by current user: # resource.id is a LongWritable collect(resource.id, 1) Combine and Reduce work just as in WordCount, except that keys and values are both LongWritables. The final step to calculate the ranking --- sorting the resources based on their accumulated access count --- is done using the unix sort command. Nothing really fancy here. This mapreduce consumed aprox. 2 hours to run --- I said 4 hours in the previous e-mail, sorry :-) IMBW, but it seems quite a long time to compute a ranking. I coded a similar application in a filter-stream framework and it took less than half an hour to run -- even with most of its data being read from the network. So, what I was wondering is: what am I doing wrong? - Is it just a matter of fine-tuning my hadoop cluster setup? - This is a valid MR application, right? - Is it just that I have too few "IO units"? I'm using 4 DataNodes and 3 TaskTrackers (dual octacores). Now, back for our regular programming... > Here are a couple thoughts: > > 1) How much data are you outputting? Obviously there is a certain amount of > IO involved in actually outputting data versus not ;-) Well, the map phase is outputting 180GB of data for aprox. 1000 million intermediate keys. I know it is going to take some time to save this amount of data to disk but yet... this much time? > 2) Are you using a reduce phase in this job? If so, since you're cutting off > the data at map output time, you're also avoiding a whole sort computation > which involves significant network IO, etc. Yes, a Reduce and a Combine phases. From the original 5000 million, the combine outputs 1000 million and the reduce ends outputting (the expected) 160 million keys. > 3) What version of Hadoop are you running? 0.18.3 Cheers. Tiago Alves Macambira -- I may be drunk, but in the morning I will be sober, while you will still be stupid and ugly." -Winston Churchill