Re: Nutch Incremental Crawl

David Philip Tue, 05 Feb 2013 21:12:11 -0800

Hi Sebastian,

  Thank you for the reply, the steps mentioned in previous email,
worked.Thanks.


One last question about incremental crawl:

My understanding is when the crawler is run on daily basis(cron job),  it
should check for each url  in its fetch list for the last date modified and
if it is not matching previous date then it should re-fetch only that page
and update db.  Is that possible like this?[any ways to do this?]

[Not like I manually maintain flat list of urls modified, feed it to free
generator and then update db. (because websites modified any time, is not
under our control. don't know how to get list of those urls modified)

Also not like; I change the db-fetch-default-interval for every day,  which
will do  re-fetch of all the pages every day taking same time as it took
for fresh crawl. correct? (our list is pages upto 5000 urls- time taken 8
hours ) so if i set db-fetch-default-interval to 1 day, every day time
taken will be 8 hours to do this process.]


Thanks very much - David.










On Tue, Feb 5, 2013 at 2:30 AM, Sebastian Nagel
<wastl.na...@googlemail.com>wrote:

> Hi David,
>
> the first steps are right but maybe it's easier to run the Java classes
> via bin/nutch:
>
> bin/nutch freegen  urls2/  freegen_segments/
> # generated: freegen_segments/123
> bin/nutch fetch  freegen_segments/123
> bin/nutch parse  freegen_segments/123  (if fetcher.parse == false)
> bin/nutch updatedb  ...  freegen_segments/123 ...
> bin/nutch linkdb ...  freegen_segments/123  ...
>
> Have a look at the command-line help for more details on
> the arguments, there is also a wiki page.
>
> But now you don't run the crawl command but just:
>
> bin/nutch solrindex ... freegen_segments/123
>
> The central point is to process only the segment you just created
> via FreeGenerator. Either you place it in a new directory, or store it in
> a variable.
>
> Cheers,
> Sebastian
>
> On 02/04/2013 10:38 AM, David Philip wrote:
> > Hi Sebastian,
> >
> >    Thank you for the reply. Executed the following steps, please correct
> me
> > if I am wrong. I do not see the changes updated.
> > Run:
> >
> >    - org.apache.nutch.tools.FreeGenerator  *arguments* :urls2
> >    crawl/segments [urls2/seed.txt  - url of the page that was modified]
> >    - org.apache.nutch.fetcher.Fetcher         *arguments*
> :crawl/segments/*
> >    - org.apache.nutch.parse.ParseSegment  *arguments*: crawl/segments/*
> >    - org.apache.nutch.crawl.CrawlDb*           arguments :
> >    crawlDBOld/crawldb crawl/segments/*
> >
> >    After this I ran crawl command :*
> >    - *org.apache.nutch.crawl.Crawl        arguments: urls
> >    -dir crawlDB_Old -depth 10 -solr http://localhost:8080/solrnutch*
> >
> >
> >  But, I don't see the changes modified for that url in the solr indexes.
> > Can you please tell me which step was skipped or not executed properly? I
> > wanted to see the changes reflected in solr indexes for that document.
> >
> >
> > Thanks - David
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sat, Feb 2, 2013 at 5:27 AM, Sebastian Nagel
> > <wastl.na...@googlemail.com>wrote:
> >
> >> Hi David,
> >>
> >>> So even If there is any modification made on a fetched
> >>> page before this interval and the crawl job is run, it will still not
> be
> >>> re-fetched/updated unless this interval is crossed.
> >> Yes. That's correct.
> >>
> >>> is there any way to do immediate update?
> >> Yes, provided that you know which documents have been changed, of
> course.
> >> Have a look at o.a.n.tools.FreeGenerator (Nutch 1.x). Start a segment
> >> for a list of URLs, fetch and parse it, and update CrawlDb.
> >>
> >> Sebastian
> >>
> >>
> >> On 02/01/2013 10:32 AM, David Philip wrote:
> >>> Hi Team,
> >>>
> >>>    I have a question on nutch re-crawl. Please let me know if there are
> >> any
> >>> links explaining this.
> >>> Q is, Does nutch re-crawl the sites/pages by checking the last date
> >>> modified?
> >>>
> >>> I understand from this
> >>> <http://nutch.wordpress.com/category/recrawling/>website that at the
> >>> time of fresh crawl, the
> >>> *db.default.fetch.interval* parameter will be set to each URL. This
> >>> property will decide whether or not the page should be re-fetched at
> >>> consequent crawls. So even If there is any modification made on a
> fetched
> >>> page before this interval and the crawl job is run, it will still not
> be
> >>> re-fetched/updated unless this interval is crossed. If above is the
> case,
> >>> is there any way to do immediate update?
> >>>
> >>> *
> >>> *
> >>> *Example case: *
> >>> Say We crawled a blog-site that had x blogs , the fetch interval was
> set
> >> to
> >>> 7 days.
> >>> But the blog-user added new blog and also modified an existing blog
> >> within
> >>> 2days. So in such case when we want to immediately update our crawl
> data
> >>> and  include newly added changes[not waiting for 7days], what need to
> be
> >>> done?  [nutch merge concept? ]
> >>>
> >>>
> >>>
> >>>
> >>> Awaiting reply,
> >>> Thanks -David
> >>>
> >>
> >>
> >
>
>

Re: Nutch Incremental Crawl

Reply via email to