On Wed, Apr 03, 2013 at 10:34:17AM -0400, David Larochelle wrote: > I'm trying to optimize a database driven web crawler and I was wondering if > anyone could offer any recommendations for interprocess communications. > > Currently, the driver process periodically queries a database to get a > list of URLs to crawler. It then stores these url's to be downloaded in a > complex in memory and pipes them to separate processes that do the actual > downloading. The problem is that the database queries are slow and block > the driver process. > > I'd like to rearchitect the system so that the database queries occur in > the background. However, I'm unsure what mechanism to use. In theory, > threads would be ideally suited for this case but I'm not sure how stable > they are (I'm running Perl 5.14.1 with 200+ cpan modules). > > Does anyone have any recommendations?
Here's my 5 cents worth [1]. I would imagine that the database query, though slow, produces results faster than the crawlers consume them. So the problem sounds like one about maintaining a minimum queue size. An actual queue would be the obvious solution, but that brings in Threads, which you want to avoid. I can think of a couple of other other solutions: Since you need parallelism, both depend on moving the database querying to another process. The database transfer from the db (producer) process to the driver (consumer) process could use files or a socketpair between parent and child. File solution: The producer program reads the database and writes the data into files in chunks of size N. The consumer program, when it runs low, reads one of the files and renames it. To simplify access, I would use the maildir model and use the three directories .../tmp .../ready .../done The producer writes each file into tmp and uses rename to move the file from tmp to ready. Luckily, perl's rename is a wrapper around rename(2)[2] so you gain the critical benefit of atomicity here. Similarly, the consumer reads a file in ready and renames it into done. Periodically, you'll want to clean out done. SocketPair: Make the producer a parent process of the consumer. The producer creates a socketpair and forks the consumer. The producer prefetches the data and writes it to the socket when the consumer signals a need for more data. Both would need to use select otherwise one or the other could block, defeating the goal of the exercise. I should point out that technically, the processes don't have to be related and you really only need sockets. The producer only needs to prefetch and buffer data and provide it as soon as possible. A simple server can do that in a tight loop all day long (I don't recommend it though unless you have a feedback mechanism into the database to ensure that it always gets the correct results even in the face of an earlier error). [1] Inflation. [2] I am making the big assumption that you are on a unixish system here. If that's true, then you can also add notifications as well, if you care. -Gyepi _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm