On Wed, Apr 03, 2013 at 10:34:17AM -0400, David Larochelle wrote:
> I'm trying to optimize a database driven web crawler and I was wondering if
> anyone could offer any recommendations for interprocess communications.
> 
> Currently, the driver process periodically  queries a database to get a
> list of URLs to crawler. It then stores these url's to be downloaded in a
> complex in memory and pipes them to separate processes that do the actual
> downloading. The problem is that the database queries are slow and block
> the driver process.
> 
> I'd like to rearchitect the system so that the database queries occur in
> the background. However, I'm unsure what mechanism to use. In theory,
> threads would be ideally suited for this case but I'm not sure how stable
> they are (I'm running Perl 5.14.1 with 200+ cpan modules).
> 
> Does anyone have any recommendations?

Here's my 5 cents worth [1].

I would imagine that the database query, though slow, produces results
faster than the crawlers consume them.
So the problem sounds like one about maintaining a minimum queue size. 

An actual queue would be the obvious solution, but that brings in Threads,
which you want to avoid.

I can think of a couple of other other solutions:

Since you need parallelism, both depend on moving the database querying to
another process.

The database transfer from the db (producer) process to the driver (consumer)
process could use files or a socketpair between parent and child.

File solution:

The producer program reads the database and writes the data
into files in chunks of size N. The consumer program, when it runs
low, reads one of the files and renames it.

To simplify access, I would use the maildir model and use the three directories

    .../tmp
    .../ready
    .../done

The producer writes each file into tmp and uses rename to move the file
from tmp to ready. Luckily, perl's rename is a wrapper around rename(2)[2]
so you gain the critical benefit of atomicity here.

Similarly, the consumer reads a file in ready and
renames it into done. Periodically, you'll want to clean out done.

SocketPair:

Make the producer a parent process of the
consumer. The producer creates a socketpair and forks the consumer. The
producer prefetches the data and writes it to the socket when the consumer
signals a need for more data. Both would need to use select otherwise one
or the other could block, defeating the goal of the exercise.

I should point out that technically, the processes don't have to be related
and you really only need sockets. The producer only needs to prefetch and
buffer data and provide it as soon as possible.
A simple server can do that in a tight loop all day long (I don't recommend it
though unless you have a feedback mechanism into the database to ensure that
it always gets the correct results even in the face of an earlier error).

[1] Inflation.
[2] I am making the big assumption that you are on a unixish system here.
If that's true, then you can also add notifications as well, if you care.

-Gyepi

_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to