I dropped my webpage database, restarted with 5 seed urls. First fetch completed in a few seconds. The second run, still shows 1 reduce running, although it shows as 100% complete, so my thought is it is writing out to the disk, though it has been about 30+ minutes. Again, I had 80 reducers, when I look at the log of these reducers in the hadoop jobtracker, I see
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues in all of them, which leads me to think that the completed 79 reducers actually fetched nothing, which might explain why this 1 stuck reducer is working so hard. This may be expected, since I am crawling a single domain. This one reducers log on the jobtracker however, is empty. Don't know what to make of that. On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi, > > On Tue, Jul 2, 2013 at 3:53 PM, h b <hb6...@gmail.com> wrote: > > > So, I tried this with the generate.max.count property set to 5000, > rebuild > > ant; ant jar; ant job and reran fetch. > > It still appears the same, first 79 reducers zip through and the last one > > is crawling, literally... > > > > Sorry I should have been more explicit. This property does not directly > affect fetching. It is used when GENERATING fetch lists. Meaning that it > needs to be present and acknowledged at the generate phase... before > fetching is executed. > Besides this, is there any progress being made at all on the last reduce? > if you look at your CPU (and heap) for the box this is running on, it is > usual to notice high levels for both of these respectively. Maybe this > output writer is just taking a good while to write data down to HDFS... > assuming you are using 1.x. > > > > > > As for the logs, I mentioned on one of my earlier threads that when I run > > from the deploy directory, I am not getting any logs generated. > > I looked for the logs directory under local as well as under deploy, and > > just to make sure, also in the grid. I do not see the logs directory. So > I > > created it manually under deploy before starting fetch, and still there > is > > nothing in this directory, > > > > > OK so when you run Nutch as a deployed job in your logs are present within > $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g. > you will be able to see the reduce tasks for the fetch job and you will > also be able to see varying snippets or all of the log here. >