Re: Nutch Hadoop Job plugins property

Tejas Patil Mon, 09 Dec 2013 22:36:01 -0800

On Mon, Dec 9, 2013 at 6:07 PM, S.L <simpleliving...@gmail.com> wrote:


> Thanks for a great reply!
>
> Right now I have a 4 urls in my seed file with domains d1,d2,d3,d4.


> I see that when the nutch job is being run on Hadoop its only picking up
> URLs for d4, there does not seem to be any parallelism .
>

I would recommend you to run all phases of nutch INDIVIDUALLY and look into
the logs for the generate and fetch phases. Set log level for generate to
DEBUG.
One possible reason: All urls of host 'd4' had more score than the other
ones. This is less likely to cause this issue as your topN value is large.


> I am running the Nutch job using the following command.
>
> bin/hadoop jar
> /home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job
> org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000
> -topN 30000


I am not sure but I think that the crawl command is deprecated. You might
have to use 'bin/crawl' script instead.

>
>
>
>
> On Mon, Dec 9, 2013 at 8:16 PM, Tejas Patil <tejas.patil...@gmail.com
> >wrote:
>
> > When you run Nutch over Hadoop ie. deploy mode, you use the job file
> > (apache-nutch-1.X.job). This is nothing but a big fat zip file
> > containing (you can unzip it and verify yourself) :
> > (a) all the nutch classes compiled,
> > (b) config files and
> > (c) dependent jars
> >
> > When hadoop launches map-reduce jobs for nutch:
> > 1. This nutch job file is copied over to the node where your job is
> > executed (say map task),
> > 2. It is unpacked
> > 3. Nutch gets the nutch-site.xml and nutch-default.xml, loads the
> configs.
> > 4. By default, plugin.folders is set to "plugins" which is a relative
> path.
> > It would search the plugin classes in the classpath under a directory
> named
> > "plugins".
> > 5. The "plugins" directory is under a directory named "classes" which is
> in
> > the classpath (this is inside the extracted job file). Now, required
> plugin
> > classes are loaded from here and everything runs fine.
> >
> > In short: Leave it as it is. It should work over Hadoop by default.
> >
> > Thanks,
> > Tejas
> >
> > On Mon, Dec 9, 2013 at 4:54 PM, S.L <simpleliving...@gmail.com> wrote:
> >
> > > What should be the plugins property be set to when running Nutch as a
> > > Hadoop job ?
> > >
> > > I just created a deploy mode jar running the ant script , I see that
> the
> > > value of the plugins property is being copied and used from the
> > > confiuration into the hadoop job. While it seems to be getting the
> > plugins
> > > directory  because Hadoop is being run on the same machine , I am sure
> it
> > > will fail when moved to a different machine.
> > >
> > > How should I set the plugins property so that it is relative to the
> > hadoop
> > > job?
> > >
> > > Thanks
> > >
> >
>

Re: Nutch Hadoop Job plugins property

Reply via email to