Hi,

I have followed the given link and updated 'db.max.outlinks.per.page' to -1
in 'nutch-default' file.

but facing same issue while crawling '
http://www.halliburton.com/en-US/default.page & cnn.com', below is the last
line of fetcher job which shows 0 page found on 3rd or 4th iteration.

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
FetcherJob: done

Please note that when I crawl amazon & others sites it works fine. Do you
think is it because of some restriction of halliborton (robot.txt) or some
misconfiguration at my end?

Regards,
Jamshaid


On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
> Can you please try this
> http://s.apache.org/wIC
> Thanks
> Lewis
>
>
> On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <[email protected]
> >wrote:
>
> > Hi,
> >
> > I'm using nutch 2.x with HBase and tried to crawl "
> > http://www.halliburton.com/en-US/default.page"; site for depth level 5.
> >
> > Following is the command:
> >
> > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
> >
> >
> > It worked well till 3rd iteration but for remaining 4th and 5th nothing
> > fetched (same case happened with cnn.com). but if i tried to crawl other
> > sites like amazon with depth level 5 it works.
> >
> > Could you please guide what could be the reasons for failing of 4th and
> 5th
> > iteration.
> >
> >
> > Regards,
> > Jamshaid
> >
>
>
>
> --
> *Lewis*
>

Reply via email to