Spoke too soon, the fetch completed in 21 min.
On Wed, Jul 3, 2013 at 8:32 AM, h b <hb6...@gmail.com> wrote: > oh and yes, generate.max.count is set to 5000 > > > On Wed, Jul 3, 2013 at 8:29 AM, h b <hb6...@gmail.com> wrote: > >> I dropped my webpage database, restarted with 5 seed urls. First fetch >> completed in a few seconds. The second run, still shows 1 reduce running, >> although it shows as 100% complete, so my thought is it is writing out to >> the disk, though it has been about 30+ minutes. >> Again, I had 80 reducers, when I look at the log of these reducers in the >> hadoop jobtracker, I see >> >> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs >> in 0 queues >> >> in all of them, which leads me to think that the completed 79 reducers >> actually fetched nothing, which might explain why this 1 stuck reducer is >> working so hard. >> >> >> This may be expected, since I am crawling a single domain. This one reducers >> log on the jobtracker however, is empty. Don't know what to make of that. >> >> >> >> >> >> On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney < >> lewis.mcgibb...@gmail.com> wrote: >> >>> Hi, >>> >>> On Tue, Jul 2, 2013 at 3:53 PM, h b <hb6...@gmail.com> wrote: >>> >>> > So, I tried this with the generate.max.count property set to 5000, >>> rebuild >>> > ant; ant jar; ant job and reran fetch. >>> > It still appears the same, first 79 reducers zip through and the last >>> one >>> > is crawling, literally... >>> > >>> >>> Sorry I should have been more explicit. This property does not directly >>> affect fetching. It is used when GENERATING fetch lists. Meaning that it >>> needs to be present and acknowledged at the generate phase... before >>> fetching is executed. >>> Besides this, is there any progress being made at all on the last reduce? >>> if you look at your CPU (and heap) for the box this is running on, it is >>> usual to notice high levels for both of these respectively. Maybe this >>> output writer is just taking a good while to write data down to HDFS... >>> assuming you are using 1.x. >>> >>> >>> > >>> > As for the logs, I mentioned on one of my earlier threads that when I >>> run >>> > from the deploy directory, I am not getting any logs generated. >>> > I looked for the logs directory under local as well as under deploy, >>> and >>> > just to make sure, also in the grid. I do not see the logs directory. >>> So I >>> > created it manually under deploy before starting fetch, and still >>> there is >>> > nothing in this directory, >>> > >>> > >>> OK so when you run Nutch as a deployed job in your logs are present >>> within >>> $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g. >>> you will be able to see the reduce tasks for the fetch job and you will >>> also be able to see varying snippets or all of the log here. >>> >> >> >