Thanks for all the feedback. I left out a lot of details about the system because I didn't want to complicate things.
The purpose of the system is comprehensively study online media. We need the system to run 24 hours a day to download news articles in media sources such as the New York Times. We don't need to download articles instantly but we do need to make sure we don't miss articles and that we're able to download articles before they're taken down. We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored in the database. The system contains a list of media sources with associated RSS feeds. We have a downloads table that has all of the URLs that we want to download or have downloaded in the past. This table currently has ~200 million rows. We add downloads for the RSS feed of each source to the downloads table a few times a day. When these RSS feeds are downloaded, they are analyzed for new stories. We add download entries for each new story that we encounter. We have an indexed enum field in the database that lets us know if a download in the downloads table has already been downloaded, needs to be downloaded, is queued in memory, or is currently being downloaded. We have ~1.5 million URLs that need to be downloaded at anyone time. We have an engine process that forks off a bunch of fetcher processes. The engine process queries the database for downloads and stores them in an in memory queue. The engine goes through its in memory queue and then pipes the primary key of of downloads to fetcher processes that download and process them. When the in memory queue drops below a certain point, the engine queries the database for more downloads. This system worked fine for many years. However, we've recently increased the number of media sources and are running into problems. The problem is that the engine is a single process and while it's querying the database, no downloads can be passed to fetchers. We spent about a day trying to optimize the database queries but they're still slower than we would like. The exact reasons why the db queries are slow are complicated. Part of the problem is that some of the logic that prevents us from excessively downloading from any one site is encapsulated in SQL queries. My hope is to split the engine process into two pieces that ran in parallel: one to query the database and another to send downloads to fetchers. This way it won't matter how long the db query takes as long as we can get URLs from the database faster than we can download them. There's a case to be made for switching this system to a more distributed architecture. However, it's a large code base and we're very tried to Postgresql. The code this consumes the downloads also uses Postgesql. We've had some performance issues aggregating the date for which we've considering NoSQL solutions. But this would be a major rewrite and a version of the crawler needs to be available constantly. On Wed, Apr 3, 2013 at 5:43 PM, Tom Metro <tmetro+boston...@gmail.com>wrote: > David Larochelle wrote: > > Currently, the driver process periodically queries a database to get > > a list of URLs to [crawl]. It then stores these url's to be > > downloaded in a complex in memory [structure?] and pipes them to > > separate processes that do the actual downloading. > > > > The problem is that the database queries are slow and block > > the driver process. > > Your description leaves out a lot of details about what sort of data is > being passed back and forth between the different processing stages. > > So the first stage is querying a database to get a list of URLs. You > then say it stores the URL in a complex memory structure? Why? Why isn't > the input to the next stage simply a URL? > > If you were running a collection of crawler processes, and each had its > own code to retrieve a URL from the database, it wouldn't matter if it > blocked. Is there enough meta-data to allow independent crawler > processes to each pick a URL without overlapping with the other processes? > > Another possibility is to create some middleware. A server that queries > the database, builds a queue in memory, then accepts connections from > the crawler processes and hands out a URL from the queue to each. > > Without know what sort of data you are exchanging, and how frequently, I > can't say whether in-memory IPC and threads are good/necessary > solutions, or if you'd be better off just running a bunch of independent > processes. > > It's hard to say from what you've described so far, but this is sounding > like a map-reduce problem. If you followed that algorithm, the first > stage builds the list of URLs to crawl. The second stage spawns a pile > of children to crawl the URLs individually or in batches, and produces > intermediary results. The final stage aggregates the results. > > (There are some existing modules and tools you could use to implement > this. Hadoop, for example. > http://en.wikipedia.org/wiki/Map_reduce > > http://search.cpan.org/~drrho/Parallel-MapReduce-0.09/lib/Parallel/MapReduce.pm > http://www.slideshare.net/philwhln/map-reduce-using-perl > http://hadoop.apache.org/ > ) > > Whatever you choose for IPC, I'd give consideration to how that > mechanism could be used across a cluster of machines, so you have the > option to scale up. > > -Tom > > -- > Tom Metro > Venture Logic, Newton, MA, USA > "Enterprise solutions through open source." > Professional Profile: http://tmetro.venturelogic.com/ > _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm