Per-host fetch-interval

2009-06-23 Thread Sandeep Tata
Hi,

I was wondering what would be the best way to configure per-host
re-crawl intervals. The default db.fetch.interval applies to all URLs,
but I'd like for some hosts to be recrawled more frequently. Is there
a JIRA ticket open on this? I haven't been able to find one

Sandeep


Re: Per-host fetch-interval

2009-06-24 Thread Andrzej Bialecki

Sandeep Tata wrote:

Hi,

I was wondering what would be the best way to configure per-host
re-crawl intervals. The default db.fetch.interval applies to all URLs,
but I'd like for some hosts to be recrawled more frequently. Is there
a JIRA ticket open on this? I haven't been able to find one


Fetch interval can be set on individual CrawlDatum-s in crawldb, at 
least technically speaking. In practice, there is no command-line tool 
to do this, and I don;t think there is a JIRA on this.


One idea would be to modify the Injector to accept a list of URL-s with 
matching metadata, and among others use a predefined metadata like 
fetchInterval. On the initial injection, all values in CrawlDatum would 
be set according to the metadata (or set to defaults). On subsequent 
injections, if a URL already exists in CrawlDb, its metadata would be 
reset to the values supplied in the injector file.


This should be easy to implement, and I think it would support your use 
case.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Per-host fetch-interval

2009-06-24 Thread Sandeep Tata
Thanks Andrzej.
I'm planning to modify the update tool to reset the fetchInterval in the
crawldb for hosts specified in separate file.



On Wed, Jun 24, 2009 at 1:39 AM, Andrzej Bialecki  wrote:

> Sandeep Tata wrote:
>
>> Hi,
>>
>> I was wondering what would be the best way to configure per-host
>> re-crawl intervals. The default db.fetch.interval applies to all URLs,
>> but I'd like for some hosts to be recrawled more frequently. Is there
>> a JIRA ticket open on this? I haven't been able to find one
>>
>
> Fetch interval can be set on individual CrawlDatum-s in crawldb, at least
> technically speaking. In practice, there is no command-line tool to do this,
> and I don;t think there is a JIRA on this.
>
> One idea would be to modify the Injector to accept a list of URL-s with
> matching metadata, and among others use a predefined metadata like
> fetchInterval. On the initial injection, all values in CrawlDatum would be
> set according to the metadata (or set to defaults). On subsequent
> injections, if a URL already exists in CrawlDb, its metadata would be reset
> to the values supplied in the injector file.
>
> This should be easy to implement, and I think it would support your use
> case.
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>