Aurl) url is re-fetched (it shows in log) but at
> > Solr
> > >> > level
> > >> > > : for that url (aurl): content field and title field didn't get
> > >> updated.
> > >> > > Why? should I do any configuration for this to make sol
> > Added new url to the crawling site
> >> > > The url got indexed - This is success. So interested to know why the
> >> > above
> >> > > case failed? What configuration need to be made?
> >> > >
> >> > >
>
. So interested to know why the
>> > above
>> > > case failed? What configuration need to be made?
>> > >
>> > >
>> > > Thanks - David
>> > >
>> > >
>> > > *PS:*
>> > > Apologies that I am still
nd good way for incremental crawl so trying different approaches.
> > Once I
> > > am clear I will blog this and share it. Thanks lot for replies from
> > mailer.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On W
> >
> >
> >
> >
> >
> >
> > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> > wrote:
> >
> > > You can simply reinject the records. You can overwrite and/or update
> the
> > > current record. See the db.injector.update and overwrite
s from mailer.
>
>
>
>
>
>
>
> On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> wrote:
>
> > You can simply reinject the records. You can overwrite and/or update the
> > current record. See the db.injector.update and overwrite settings.
> >
> >
ds. You can overwrite and/or update the
> current record. See the db.injector.update and overwrite settings.
>
> -Original message-
> > From:David Philip
> > Sent: Wed 27-Feb-2013 11:23
> > To: user@nutch.apache.org
> > Subject: Re: Nutch Incremental Crawl
&g
You can simply reinject the records. You can overwrite and/or update the
current record. See the db.injector.update and overwrite settings.
-Original message-
> From:David Philip
> Sent: Wed 27-Feb-2013 11:23
> To: user@nutch.apache.org
> Subject: Re: Nutch Incremental C
dFetchInterval=86400
>
>
> -Original message-
> > From:David Philip
> > Sent: Wed 27-Feb-2013 06:21
> > To: user@nutch.apache.org
> > Subject: Re: Nutch Incremental Crawl
> >
> > Hi all,
> >
> > Thank you very much for the replies. Ve
> To: user@nutch.apache.org
> Subject: Re: Nutch Incremental Crawl
>
> Hi all,
>
> Thank you very much for the replies. Very useful information to
> understand how incremental crawling can be achieved.
>
> Dear Markus:
> Can you please tell me how do I over ride this fetch
; -Original message-----
> > > From:kemical
> > > Sent: Thu 14-Feb-2013 10:15
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch Incremental Crawl
> > >
> > > Hi David,
> > >
> > > You can also consider setting shorter fe
> > Sent: Thu 14-Feb-2013 10:15
> > To: user@nutch.apache.org
> > Subject: Re: Nutch Incremental Crawl
> >
> > Hi David,
> >
> > You can also consider setting shorter fetch interval time with nutch
> inject.
> > This way you'll set highe
If you want records to be fetched at a fixed interval its easier to inject them
with a fixed fetch interval.
nutch.fixedFetchInterval=86400
-Original message-
> From:kemical
> Sent: Thu 14-Feb-2013 10:15
> To: user@nutch.apache.org
> Subject: Re: Nutch Incremental C
Hi David,
You can also consider setting shorter fetch interval time with nutch inject.
This way you'll set higher score (so the url is always taken in priority
when you generate a segment) and a fetch.interval of 1 day.
If you have a case similar to me, you'll often want some homepage fetch each
Hi Sebastian,
Thank you for the reply, the steps mentioned in previous email,
worked.Thanks.
One last question about incremental crawl:
My understanding is when the crawler is run on daily basis(cron job), it
should check for each url in its fetch list for the last date modified and
if it is
Hi David,
the first steps are right but maybe it's easier to run the Java classes via
bin/nutch:
bin/nutch freegen urls2/ freegen_segments/
# generated: freegen_segments/123
bin/nutch fetch freegen_segments/123
bin/nutch parse freegen_segments/123 (if fetcher.parse == false)
bin/nutch updat
Hi Sebastian,
Thank you for the reply. Executed the following steps, please correct me
if I am wrong. I do not see the changes updated.
Run:
- org.apache.nutch.tools.FreeGenerator *arguments* :urls2
crawl/segments [urls2/seed.txt - url of the page that was modified]
- org.apache.nut
Hi David,
> So even If there is any modification made on a fetched
> page before this interval and the crawl job is run, it will still not be
> re-fetched/updated unless this interval is crossed.
Yes. That's correct.
> is there any way to do immediate update?
Yes, provided that you know which doc
18 matches
Mail list logo