RE: Nutch cannot crawl entire website

Markus Jelsma Tue, 01 Mar 2016 03:47:18 -0800

Hi - i am not familiar with 2x but if those are your commands, then you are 
either missing the parse job or fetcher.parse=true, and not performing an 
updatedb job to write discovered records back to the DB.


Markus 

-----Original message-----
> From:Tom Running <runningt...@gmail.com>
> Sent: Tuesday 1st March 2016 5:39
> To: user@nutch.apache.org
> Subject: Nutch cannot crawl entire website
> 
> Hello,
> 
> I am using nutch 2.3.1
> 
> I preform the commands:
> ./nutch inject ../urls/seed.txt
> ./nutch generate -topN 2500
> ./nutch fetch -all
> 
> The problem is, the data only displays the raw HTML from the first
> URL/page. All the other URLS that were accumulated by the generate command
> are not actually crawled.
> 
> I cannot get nutch to crawl the other generated urls...I also cannot get
> nutch to crawl the entire website. What are the options that I need to use
> to crawl an entire site?
> 
> Does anyone have any insights or recommendations?
> 
> Thank you so much for your help,
> -T
>

RE: Nutch cannot crawl entire website

Reply via email to