On 12/26/18 10:35 AM, Simon Connah wrote: > Hi, > > I want to build a simple web crawler. I know how I am going to do it > but I have one problem. > > Obviously I don't want to negatively impact any of the websites that I > am crawling so I want to implement some form of rate limiting of HTTP > requests to specific domain names. > > What I'd like is some form of timer which calls a piece of code say > every 5 seconds or something and that code is what goes off and crawls > the website. > > I'm just not sure on the best way to call code based on a timer. > > Could anyone offer some advice on the best way to do this? It will be > running on Linux and using the python-daemon library to run it as a > service and will be using at least Python 3.6. > > Thanks for any help.
One big piece of information that would help in replies would be an indication of scale. Is you application crawling just a few sites, so that you need to pause between accesses to keep the hit rate down, or are you calling a number of sites, so that if you are going to delay crawling a page from one site, you can go off and crawl another in the mean time? -- Richard Damon -- https://mail.python.org/mailman/listinfo/python-list