Re: Nutch scalability tests

Tejas Patil Wed, 03 Jul 2013 08:55:33 -0700

The steps you performed are right.

Did you get the log for that one "hardworking" reducer ? It will hint us
why the job took so much. Ideally you should get logs for every job and its
attempts. If you cannot get the log for that reducer, then I feel that your
cluster is having some problem and this needs to be addressed.



On Wed, Jul 3, 2013 at 8:47 AM, h b <hb6...@gmail.com> wrote:

> Hi Tejas, looks like we were tying at the same time
> So anyway, my job ended fine, just to be sure what I am doing is right, I
> have cleared the db and started another round again. If I stumble again,
> will respond back on this thread.
>
>
> On Wed, Jul 3, 2013 at 8:43 AM, Tejas Patil <tejas.patil...@gmail.com
> >wrote:
>
> > > The second run, still shows 1 reduce running, although it shows as 100%
> > complete, so my thought is it is writing out to the disk, though it has
> > been about 30+ minutes.
> > > This one reducers log on the jobtracker however, is empty.
> >
> > This is weird. There can be a explanation for first line: The data
> crawled
> > was large so dumping would take a lot of time but as you said there were
> > very less urls so it should not take 30+ mins unless you crawled some
> super
> > large files.
> > Have you checked the job attempts for the job ? If there are no logs
> there
> > then there is something weird going on with your cluster.
> >
> >
> > On Wed, Jul 3, 2013 at 8:32 AM, h b <hb6...@gmail.com> wrote:
> >
> > > oh and yes, generate.max.count is set to 5000
> > >
> > >
> > > On Wed, Jul 3, 2013 at 8:29 AM, h b <hb6...@gmail.com> wrote:
> > >
> > > > I dropped my webpage database, restarted with 5 seed urls. First
> fetch
> > > > completed in a few seconds. The second run, still shows 1 reduce
> > running,
> > > > although it shows as 100% complete, so my thought is it is writing
> out
> > to
> > > > the disk, though it has been about 30+ minutes.
> > > > Again, I had 80 reducers, when I look at the log of these reducers in
> > the
> > > > hadoop jobtracker, I see
> > > >
> > > > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
> > > URLs in 0 queues
> > > >
> > > > in all of them, which leads me to think that the completed 79
> reducers
> > > actually fetched nothing, which might explain why this 1 stuck reducer
> is
> > > working so hard.
> > > >
> > > > This may be expected, since I am crawling a single domain. This one
> > > reducers log on the jobtracker however, is empty. Don't know what to
> make
> > > of that.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney <
> > > > lewis.mcgibb...@gmail.com> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> On Tue, Jul 2, 2013 at 3:53 PM, h b <hb6...@gmail.com> wrote:
> > > >>
> > > >> > So, I tried this with the generate.max.count property set to 5000,
> > > >> rebuild
> > > >> > ant; ant jar; ant job and reran fetch.
> > > >> > It still appears the same, first 79 reducers zip through and the
> > last
> > > >> one
> > > >> > is crawling, literally...
> > > >> >
> > > >>
> > > >> Sorry I should have been more explicit. This property does not
> > directly
> > > >> affect fetching. It is used when GENERATING fetch lists. Meaning
> that
> > it
> > > >> needs to be present and acknowledged at the generate phase... before
> > > >> fetching is executed.
> > > >> Besides this, is there any progress being made at all on the last
> > > reduce?
> > > >> if you look at your CPU (and heap) for the box this is running on,
> it
> > is
> > > >> usual to notice high levels for both of these respectively. Maybe
> this
> > > >> output writer is just taking a good while to write data down to
> > HDFS...
> > > >> assuming you are using 1.x.
> > > >>
> > > >>
> > > >> >
> > > >> > As for the logs, I mentioned on one of my earlier threads that
> when
> > I
> > > >> run
> > > >> > from the deploy directory, I am not getting any logs generated.
> > > >> > I looked for the logs directory under local as well as under
> deploy,
> > > and
> > > >> > just to make sure, also in the grid. I do not see the logs
> > directory.
> > > >> So I
> > > >> > created it manually under deploy before starting fetch, and still
> > > there
> > > >> is
> > > >> > nothing in this directory,
> > > >> >
> > > >> >
> > > >> OK so when you run Nutch as a deployed job in your logs are present
> > > within
> > > >> $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp
> > e.g.
> > > >> you will be able to see the reduce tasks for the fetch job and you
> > will
> > > >> also be able to see varying snippets or all of the log here.
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Nutch scalability tests

Reply via email to