Hi,

I'm working on prototyping a web crawler using Ignite as the crawl-db. I'd
like to ensure the crawler obey's the appropriate Craw-Delay time as set in
a site's robots.txt file - the way I have this setup now, is by submitting
"candidates" to an Ignite cache. A local listener is setup to receive
successfully persisted items, which then submits the items to a queue for a
fetcher to pull from.

Goal: Support a delay time + maximum fetch concurrency, per-host, per-item.

Put another way: "for each fetch item, ensure that requests made to the
associated host are delayed as required, and no more than n-requests are
made during each delayed run".

This could be modeled as a Map<Host,DelayQueue> or maybe even a by using
ScheduledExecutorService where each task represents a host, and is repeated
according to the delay time.

I'd like to prevent items from being put into the java work queue if they
are not yet ready to be fetched, and I'm slightly worried about the
potential number of hosts (in reference to the java Map<Host,...>
data-structure).

So my question is: is there something that Ignite can provide for making
this all work?

- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to