Per-host fetch-interval
Hi, I was wondering what would be the best way to configure per-host re-crawl intervals. The default db.fetch.interval applies to all URLs, but I'd like for some hosts to be recrawled more frequently. Is there a JIRA ticket open on this? I haven't been able to find one Sandeep
Re: Per-host fetch-interval
Sandeep Tata wrote: Hi, I was wondering what would be the best way to configure per-host re-crawl intervals. The default db.fetch.interval applies to all URLs, but I'd like for some hosts to be recrawled more frequently. Is there a JIRA ticket open on this? I haven't been able to find one Fetch interval can be set on individual CrawlDatum-s in crawldb, at least technically speaking. In practice, there is no command-line tool to do this, and I don;t think there is a JIRA on this. One idea would be to modify the Injector to accept a list of URL-s with matching metadata, and among others use a predefined metadata like fetchInterval. On the initial injection, all values in CrawlDatum would be set according to the metadata (or set to defaults). On subsequent injections, if a URL already exists in CrawlDb, its metadata would be reset to the values supplied in the injector file. This should be easy to implement, and I think it would support your use case. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Per-host fetch-interval
Thanks Andrzej. I'm planning to modify the update tool to reset the fetchInterval in the crawldb for hosts specified in separate file. On Wed, Jun 24, 2009 at 1:39 AM, Andrzej Bialecki wrote: > Sandeep Tata wrote: > >> Hi, >> >> I was wondering what would be the best way to configure per-host >> re-crawl intervals. The default db.fetch.interval applies to all URLs, >> but I'd like for some hosts to be recrawled more frequently. Is there >> a JIRA ticket open on this? I haven't been able to find one >> > > Fetch interval can be set on individual CrawlDatum-s in crawldb, at least > technically speaking. In practice, there is no command-line tool to do this, > and I don;t think there is a JIRA on this. > > One idea would be to modify the Injector to accept a list of URL-s with > matching metadata, and among others use a predefined metadata like > fetchInterval. On the initial injection, all values in CrawlDatum would be > set according to the metadata (or set to defaults). On subsequent > injections, if a URL already exists in CrawlDb, its metadata would be reset > to the values supplied in the injector file. > > This should be easy to implement, and I think it would support your use > case. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >