Re: Nutch Incremental Crawl

feng lu Mon, 04 Mar 2013 23:25:38 -0800

Hi David

yes, it's a tomcat web service cache.


The dump file can use "less" command to open if you use linux OS. or you
can use
"bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/"; to
dump the information of specified url.




On Tue, Mar 5, 2013 at 3:02 PM, feng lu <amuseme...@gmail.com> wrote:

>
>
>
> On Tue, Mar 5, 2013 at 2:49 PM, David Philip 
> <davidphilipshe...@gmail.com>wrote:
>
>> Hi,
>>
>>     web server cache - you mean /tomcat/work/; where the solr is running?
>> Did u mean that cache?
>>
>> I tried to use the below command {bin/nutch readseg -dump
>> crawltest/segments/20130304185844/ crawltest/test}and it gives dump file,
>> format is GMC link (application/x-gmc-link)  - I am not able to open it.
>> How to open this file?
>>
>> How ever when I ran :  bin/nutch readseg -list
>> crawltest/segments/20130304185844/
>> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
>> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1
>>
>>
>> - David
>>
>>
>>
>>
>>
>> On Tue, Mar 5, 2013 at 11:25 AM, feng lu <amuseme...@gmail.com> wrote:
>>
>> > Hi David
>> >
>> > Do you clear the web server cache. Maybe the refetch is also crawl the
>> old
>> > page.
>> >
>> > Maybe you can dump the url content to check the modification.
>> > using bin/nutch readseg command.
>> >
>> > Thanks
>> >
>> >
>> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip <
>> davidphilipshe...@gmail.com
>> > >wrote:
>> >
>> > > Hi Markus,
>> > >
>> > >   So I was trying with the *db.injector.update *point that you
>> mentioned,
>> > > please see my observations below*. *
>> > > Settings: I did  *db.injector.update * to* true *and   *
>> > > db.fetch.interval.default *to* 1hour. *
>> > > *
>> > > *
>> > > *
>> > > *
>> > > *Observation:*
>> > >
>> > > On first time crawl[1],  14 urls were successfully crawled and
>> indexed to
>> > > solr.
>> > > case 1 :
>> > > In those 14 urls I modified the content and title of one url (say
>> Aurl)
>> > and
>> > > re executed the crawl after one hour.
>> > > I see that this(Aurl) url is re-fetched (it shows in log) but at Solr
>> > level
>> > > : for that url (aurl): content field and title field didn't get
>> updated.
>> > > Why? should I do any configuration for this to make solr index get
>> > updated?
>> > >
>> > > case2:
>> > > Added new url to the crawling site
>> > > The url got indexed - This is success. So interested to know why the
>> > above
>> > > case failed? What configuration need to be made?
>> > >
>> > >
>> > > Thanks - David
>> > >
>> > >
>> > > *PS:*
>> > > Apologies that I am still asking questions on same topic. I am not
>> able
>> > to
>> > > find good way for incremental crawl so trying different approaches.
>> >  Once I
>> > > am clear I will blog this and share it. Thanks lot for replies from
>> > mailer.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
>> > > <markus.jel...@openindex.io>wrote:
>> > >
>> > > > You can simply reinject the records.  You can overwrite and/or
>> update
>> > the
>> > > > current record. See the db.injector.update and overwrite settings.
>> > > >
>> > > > -----Original message-----
>> > > > > From:David Philip <davidphilipshe...@gmail.com>
>> > > > > Sent: Wed 27-Feb-2013 11:23
>> > > > > To: user@nutch.apache.org
>> > > > > Subject: Re: Nutch Incremental Crawl
>> > > > >
>> > > > > HI Markus, I meant over riding  the injected interval.. How to
>> > override
>> > > > the
>> > > > > injected fetch interval?
>> > > > > While crawling fetch interval was set 30days (default). Now I
>> want to
>> > > > > re-fetch same site (that is to force re-fetch) and not wait for
>> fetch
>> > > > > interval (30 days).. how can we do that?
>> > > > >
>> > > > >
>> > > > > Feng Lu : Thank you for the reference link.
>> > > > >
>> > > > > Thanks - David
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
>> > > > > <markus.jel...@openindex.io>wrote:
>> > > > >
>> > > > > > The default or the injected interval? The default interval can
>> be
>> > set
>> > > >  in
>> > > > > > the config (see nutch-default for example). Per URL's can be set
>> > > using
>> > > > the
>> > > > > > injector: <URL>\tnutch.fixedFetchInterval=86400
>> > > > > >
>> > > > > >
>> > > > > > -----Original message-----
>> > > > > > > From:David Philip <davidphilipshe...@gmail.com>
>> > > > > > > Sent: Wed 27-Feb-2013 06:21
>> > > > > > > To: user@nutch.apache.org
>> > > > > > > Subject: Re: Nutch Incremental Crawl
>> > > > > > >
>> > > > > > > Hi all,
>> > > > > > >
>> > > > > > >   Thank you very much for the replies. Very useful
>> information to
>> > > > > > > understand how incremental crawling can be achieved.
>> > > > > > >
>> > > > > > > Dear Markus:
>> > > > > > > Can you please tell me how do I over ride this fetch interval
>> ,
>> > > > incase
>> > > > > > if I
>> > > > > > > require to fetch the page before the time interval is passed?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Thanks very much
>> > > > > > > - David
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
>> > > > > > > <markus.jel...@openindex.io>wrote:
>> > > > > > >
>> > > > > > > > If you want records to be fetched at a fixed interval its
>> > easier
>> > > to
>> > > > > > inject
>> > > > > > > > them with a fixed fetch interval.
>> > > > > > > >
>> > > > > > > > nutch.fixedFetchInterval=86400
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > -----Original message-----
>> > > > > > > > > From:kemical <mickael.lume...@gmail.com>
>> > > > > > > > > Sent: Thu 14-Feb-2013 10:15
>> > > > > > > > > To: user@nutch.apache.org
>> > > > > > > > > Subject: Re: Nutch Incremental Crawl
>> > > > > > > > >
>> > > > > > > > > Hi David,
>> > > > > > > > >
>> > > > > > > > > You can also consider setting shorter fetch interval time
>> > with
>> > > > nutch
>> > > > > > > > inject.
>> > > > > > > > > This way you'll set higher score (so the url is always
>> taken
>> > in
>> > > > > > priority
>> > > > > > > > > when you generate a segment) and a fetch.interval of 1
>> day.
>> > > > > > > > >
>> > > > > > > > > If you have a case similar to me, you'll often want some
>> > > homepage
>> > > > > > fetch
>> > > > > > > > each
>> > > > > > > > > day but not their inlinks. What you can do is inject all
>> your
>> > > > seed
>> > > > > > urls
>> > > > > > > > > again (assuming those url are only homepages).
>> > > > > > > > >
>> > > > > > > > > #change nutch option so existing urls can be injected
>> again
>> > in
>> > > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
>> > > > > > > > > db.injector.update=true
>> > > > > > > > >
>> > > > > > > > > #Add metadata to update score/fetch interval
>> > > > > > > > > #the following line will concat to each line of your seed
>> > urls
>> > > > files
>> > > > > > with
>> > > > > > > > > the new score / new interval
>> > > > > > > > > perl -pi -e
>> > > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
>> > > > > > > > > [your_seed_url_dir]/*
>> > > > > > > > >
>> > > > > > > > > #run command
>> > > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
>> > > > > > > > >
>> > > > > > > > > Now, the following crawl will take your urls in top
>> priority
>> > > and
>> > > > > > crawl
>> > > > > > > > them
>> > > > > > > > > once a day. I've used my situation to illustrate the
>> concept
>> > > but
>> > > > i
>> > > > > > guess
>> > > > > > > > you
>> > > > > > > > > can tweek params to fit your needs.
>> > > > > > > > >
>> > > > > > > > > This way is useful when you want a regular fetch on some
>> > urls,
>> > > if
>> > > > > > it's
>> > > > > > > > > occured rarely i guess freegen is the right choice.
>> > > > > > > > >
>> > > > > > > > > Best,
>> > > > > > > > > Mike
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > View this message in context:
>> > > > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
>> > > > > > > > > Sent from the Nutch - User mailing list archive at
>> > Nabble.com.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Don't Grow Old, Grow Up... :-)
>> >
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch Incremental Crawl

Reply via email to