Another option that I've used in similar situations:

1.  have a process hit the database and generate a storable of the data
2.  have multiple crawlers execute and "unfreeze" the storable into memory
3.  do what you need to do with the data, pushing back to the database when
necessary.

Instead of forking, I run simultaneous instances of the scripts.  They all
pull in the storable data, which is on the file system (not requiring a
call back to the database).  Periodically, my storable generator kicks off,
makes a new storable (could take a couple minutes if necessary), and the
crawler scripts have a routine to check date/time of current storable.  If
older than 10 minutes, fetch new file from the server, if newer file not
available, hit database directly.

It's not the most elegant solution, but it's a fairly easy on to implement
(and avoids the treacherous downfalls of thread maintenance).



On Wed, Apr 3, 2013 at 1:30 PM, Gyepi SAM <gy...@praxis-sw.com> wrote:

>
> On Wed, Apr 03, 2013 at 10:34:17AM -0400, David Larochelle wrote:
> > I'm trying to optimize a database driven web crawler and I was wondering
> if
> > anyone could offer any recommendations for interprocess communications.
> >
> > Currently, the driver process periodically  queries a database to get a
> > list of URLs to crawler. It then stores these url's to be downloaded in a
> > complex in memory and pipes them to separate processes that do the actual
> > downloading. The problem is that the database queries are slow and block
> > the driver process.
> >
> > I'd like to rearchitect the system so that the database queries occur in
> > the background. However, I'm unsure what mechanism to use. In theory,
> > threads would be ideally suited for this case but I'm not sure how stable
> > they are (I'm running Perl 5.14.1 with 200+ cpan modules).
> >
> > Does anyone have any recommendations?
>
> Here's my 5 cents worth [1].
>
> I would imagine that the database query, though slow, produces results
> faster than the crawlers consume them.
> So the problem sounds like one about maintaining a minimum queue size.
>
> An actual queue would be the obvious solution, but that brings in Threads,
> which you want to avoid.
>
> I can think of a couple of other other solutions:
>
> Since you need parallelism, both depend on moving the database querying to
> another process.
>
> The database transfer from the db (producer) process to the driver
> (consumer)
> process could use files or a socketpair between parent and child.
>
> File solution:
>
> The producer program reads the database and writes the data
> into files in chunks of size N. The consumer program, when it runs
> low, reads one of the files and renames it.
>
> To simplify access, I would use the maildir model and use the three
> directories
>
>     .../tmp
>     .../ready
>     .../done
>
> The producer writes each file into tmp and uses rename to move the file
> from tmp to ready. Luckily, perl's rename is a wrapper around rename(2)[2]
> so you gain the critical benefit of atomicity here.
>
> Similarly, the consumer reads a file in ready and
> renames it into done. Periodically, you'll want to clean out done.
>
> SocketPair:
>
> Make the producer a parent process of the
> consumer. The producer creates a socketpair and forks the consumer. The
> producer prefetches the data and writes it to the socket when the consumer
> signals a need for more data. Both would need to use select otherwise one
> or the other could block, defeating the goal of the exercise.
>
> I should point out that technically, the processes don't have to be related
> and you really only need sockets. The producer only needs to prefetch and
> buffer data and provide it as soon as possible.
> A simple server can do that in a tight loop all day long (I don't
> recommend it
> though unless you have a feedback mechanism into the database to ensure
> that
> it always gets the correct results even in the face of an earlier error).
>
> [1] Inflation.
> [2] I am making the big assumption that you are on a unixish system here.
> If that's true, then you can also add notifications as well, if you care.
>
> -Gyepi
>
> _______________________________________________
> Boston-pm mailing list
> Boston-pm@mail.pm.org
> http://mail.pm.org/mailman/listinfo/boston-pm
>

_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to