On 26/12/2018 18:30, Richard Damon wrote:
On 12/26/18 10:35 AM, Simon Connah wrote:
Hi,

I want to build a simple web crawler. I know how I am going to do it
but I have one problem.

Obviously I don't want to negatively impact any of the websites that I
am crawling so I want to implement some form of rate limiting of HTTP
requests to specific domain names.

What I'd like is some form of timer which calls a piece of code say
every 5 seconds or something and that code is what goes off and crawls
the website.

I'm just not sure on the best way to call code based on a timer.

Could anyone offer some advice on the best way to do this? It will be
running on Linux and using the python-daemon library to run it as a
service and will be using at least Python 3.6.

Thanks for any help.

One big piece of information that would help in replies would be an
indication of scale. Is you application crawling just a few sites, so
that you need to pause between accesses to keep the hit rate down, or
are you calling a number of sites, so that if you are going to delay
crawling a page from one site, you can go off and crawl another in the
mean time?


Sorry. I should have stated that.

This is for a minimum viable product so crawling say two or three domain names would be enough to start with but I'd want to grow in the future.

I'm building this on AWS and my idea was to have each web crawler instance query a database (DynamoDB) and get say 10 URLs and if they hadn't be crawled in the previous say 12 to 24 hours then recrawl them. If they have been crawled in the last 12 to 24 hours then skip that URL. Once a URL has been crawled I would then save the crawl date and time in the database.

Doing it that way I could skip the whole timing thing on the daemon end and just use database queries to control whether a URL is crawled or not. Of course that would mean that one web crawler would have to "lock" a domain name so that multiple instances do not query the same domain name in parallel which would be bad.
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to