Ah forget about it, you are on 2.x i read in the next message. But i think it 
also has a freegen tool.
Markus

 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Wednesday 24th February 2016 13:41
> To: user@nutch.apache.org
> Subject: RE: recrawling of specific URLS
> 
> Hi - easiest method is to use the freegen tool. But if you really want 
> homepages, not just domain roots, you can use the hostdb with freegen for it.
> 
> # Update the hostdb
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/
> 
> # Get list of homepages for each host
> bin/nutch readhostdb crawl/hostdb/ output -dumpHomepages
> 
> Then use freegen.
> 
> Markus
>  
>  
> -----Original message-----
> > From:harsh <harsh.sha...@orkash.com>
> > Sent: Wednesday 24th February 2016 12:49
> > To: user@nutch.apache.org
> > Subject: recrawling of specific URLS
> > 
> > Hi All
> > 
> > Nutch is made to update ALL the URLs after a certain point of time.
> > But I want to recrawl only the home page of seed URL so that i could get 
> > new link from the home page to crawl.
> > Currently I am using  the bug "Inject command re-inject seed URLS." for 
> > recrawling my seed URLs.But this is not the standard way.
> > Please give a suggestion.I have read articles/discussions on 
> > re-crawling.But could not find the solution.
> > Lewis,Tejas Please help!!!!!
> > 
> > Thanks
> > 
> 

Reply via email to