Hi
I did not found the freegen tool for nutch 2.x. What should I do
Thanks
On Thursday 25 February 2016 12:24 PM, harsh wrote:
Dear Markus
Thanks for your Help.I hope it will solve my problem.Thanks a lot.


On Wednesday 24 February 2016 06:12 PM, Markus Jelsma wrote:
Ah forget about it, you are on 2.x i read in the next message. But i think it also has a freegen tool.
Markus

    -----Original message-----
From:Markus Jelsma <markus.jel...@openindex.io>
Sent: Wednesday 24th February 2016 13:41
To: user@nutch.apache.org
Subject: RE: recrawling of specific URLS

Hi - easiest method is to use the freegen tool. But if you really want homepages, not just domain roots, you can use the hostdb with freegen for it.

# Update the hostdb
bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/

# Get list of homepages for each host
bin/nutch readhostdb crawl/hostdb/ output -dumpHomepages

Then use freegen.

Markus
    -----Original message-----
From:harsh <harsh.sha...@orkash.com>
Sent: Wednesday 24th February 2016 12:49
To: user@nutch.apache.org
Subject: recrawling of specific URLS

Hi All

Nutch is made to update ALL the URLs after a certain point of time.
But I want to recrawl only the home page of seed URL so that i could get
new link from the home page to crawl.
Currently I am using the bug "Inject command re-inject seed URLS." for
recrawling my seed URLs.But this is not the standard way.
Please give a suggestion.I have read articles/discussions on
re-crawling.But could not find the solution.
Lewis,Tejas Please help!!!!!

Thanks



Reply via email to