On Apr 3, 2013, at 10:34 AM, David Larochelle da...@larochelle.name wrote:
Currently, the driver process periodically queries a database to get a
list of URLs to crawler. It then stores these url's to be downloaded in a
complex in memory and pipes them to separate processes that do the actual
Thanks for all the feedback. I left out a lot of details about the system
because I didn't want to complicate things.
The purpose of the system is comprehensively study online media. We need
the system to run 24 hours a day to download news articles in media sources
such as the New York Times. We
This sounds like a perfect fit for a queuing service like RabbitMQ.
Logstash uses Redis lists for this as it's simple to setup and pretty
reliable, but there are many such applications available. The queue's
would allow multiple backend processes to check for and take items as they
became
On Thu, Apr 04, 2013 at 04:21:54PM -0400, David Larochelle wrote:
My hope is to split the engine process into two pieces that ran in
parallel: one to query the database and another to send downloads to
fetchers. This way it won't matter how long the db query takes as long as
we can get URLs
David Larochelle wrote:
[...]
We're using Postgresql 8.4 and running on Ubuntu. Almost all data is
stored in
the database. The system contains a list of media sources with associated
RSS
feeds. We have a downloads table that has all of the URLs that we want to
download or have downloaded in