Re: Nutch scalability tests

h b Wed, 03 Jul 2013 08:35:41 -0700

I dropped my webpage database, restarted with 5 seed urls. First fetch
completed in a few seconds. The second run, still shows 1 reduce running,
although it shows as 100% complete, so my thought is it is writing out to
the disk, though it has been about 30+ minutes.
Again, I had 80 reducers, when I look at the log of these reducers in the
hadoop jobtracker, I see


0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
URLs in 0 queues

in all of them, which leads me to think that the completed 79 reducers
actually fetched nothing, which might explain why this 1 stuck reducer
is working so hard.
This may be expected, since I am crawling a single domain. This one
reducers log on the jobtracker however, is empty. Don't know what to
make of that.





On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi,
>
> On Tue, Jul 2, 2013 at 3:53 PM, h b <hb6...@gmail.com> wrote:
>
> > So, I tried this with the generate.max.count property set to 5000,
> rebuild
> > ant; ant jar; ant job and reran fetch.
> > It still appears the same, first 79 reducers zip through and the last one
> > is crawling, literally...
> >
>
> Sorry I should have been more explicit. This property does not directly
> affect fetching. It is used when GENERATING fetch lists. Meaning that it
> needs to be present and acknowledged at the generate phase... before
> fetching is executed.
> Besides this, is there any progress being made at all on the last reduce?
> if you look at your CPU (and heap) for the box this is running on, it is
> usual to notice high levels for both of these respectively. Maybe this
> output writer is just taking a good while to write data down to HDFS...
> assuming you are using 1.x.
>
>
> >
> > As for the logs, I mentioned on one of my earlier threads that when I run
> > from the deploy directory, I am not getting any logs generated.
> > I looked for the logs directory under local as well as under deploy, and
> > just to make sure, also in the grid. I do not see the logs directory. So
> I
> > created it manually under deploy before starting fetch, and still there
> is
> > nothing in this directory,
> >
> >
> OK so when you run Nutch as a deployed job in your logs are present within
> $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g.
> you will be able to see the reduce tasks for the fetch job and you will
> also be able to see varying snippets or all of the log here.
>

Re: Nutch scalability tests

Reply via email to