Re: [Boston.pm] Passing large complex data structures between process

David Larochelle Thu, 04 Apr 2013 13:22:15 -0700

Thanks for all the feedback. I left out a lot of details about the system
because I didn't want to complicate things.

The purpose of the system is comprehensively study online media. We need
the system to run 24 hours a day to download news articles in media sources
such as the New York Times. We don't need to download articles instantly
but we do need to make sure we don't miss articles and that we're able to
download articles before they're taken down.

We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored
in the database. The system contains a list of media sources with
associated RSS feeds. We have a downloads table that has all of the URLs
that we want to download or have downloaded in the past. This table
currently has ~200 million rows. We add downloads for the RSS feed of each
source to the downloads table a few times a day. When these RSS feeds are
downloaded, they are analyzed for new stories. We add download entries for
each new story that we encounter.  We have an indexed enum field in the
database that lets us know if a download in the downloads table has already
been downloaded, needs to be downloaded, is queued in memory, or is
currently being downloaded. We have ~1.5 million URLs that need to be
downloaded at anyone time.

We have an engine process that forks off a bunch of fetcher processes. The
engine process queries the database for downloads and stores them in an in
memory queue. The engine goes through its in memory queue and then pipes
the primary key of of downloads to fetcher processes that download and
process them. When the in memory queue drops below a certain point, the
engine queries the database for more downloads.

This system worked fine for many years. However, we've recently increased
the number of media sources and are running into problems. The problem is
that the engine is a single process and while it's querying the database,
no downloads can be passed to fetchers.  We spent about a day trying to
optimize the database queries but they're still slower than we would like.
The exact reasons why the db queries are slow are complicated. Part of the
problem is that some of the logic that prevents us from excessively
downloading from any one site is encapsulated in SQL queries.

My hope is to split the engine process into two pieces that ran in
parallel: one to query the database and another to send downloads to
fetchers. This way it won't matter how long the db query takes as long as
we can get URLs from the database faster than we can download them.

There's a case to be made for switching this system to a more distributed
architecture. However, it's a large code base and we're very tried to
Postgresql. The code this consumes the downloads also uses Postgesql.

We've had some performance issues aggregating the date for which we've
considering NoSQL solutions.  But this would be a major rewrite and a
version of the crawler needs to be available constantly.

On Wed, Apr 3, 2013 at 5:43 PM, Tom Metro <tmetro+boston...@gmail.com>wrote:

> David Larochelle wrote:
> > Currently, the driver process periodically queries a database to get
> > a list of URLs to [crawl]. It then stores these url's to be
> > downloaded in a complex in memory [structure?] and pipes them to
> > separate processes that do the actual downloading.
> >
> > The problem is that the database queries are slow and block
> > the driver process.
>
> Your description leaves out a lot of details about what sort of data is
> being passed back and forth between the different processing stages.
>
> So the first stage is querying a database to get a list of URLs. You
> then say it stores the URL in a complex memory structure? Why? Why isn't
> the input to the next stage simply a URL?
>
> If you were running a collection of crawler processes, and each had its
> own code to retrieve a URL from the database, it wouldn't matter if it
> blocked. Is there enough meta-data to allow independent crawler
> processes to each pick a URL without overlapping with the other processes?
>
> Another possibility is to create some middleware. A server that queries
> the database, builds a queue in memory, then accepts connections from
> the crawler processes and hands out a URL from the queue to each.
>
> Without know what sort of data you are exchanging, and how frequently, I
> can't say whether in-memory IPC and threads are good/necessary
> solutions, or if you'd be better off just running a bunch of independent
> processes.
>
> It's hard to say from what you've described so far, but this is sounding
> like a map-reduce problem. If you followed that algorithm, the first
> stage builds the list of URLs to crawl. The second stage spawns a pile
> of children to crawl the URLs individually or in batches, and produces
> intermediary results. The final stage aggregates the results.
>
> (There are some existing modules and tools you could use to implement
> this. Hadoop, for example.
> http://en.wikipedia.org/wiki/Map_reduce
>
> http://search.cpan.org/~drrho/Parallel-MapReduce-0.09/lib/Parallel/MapReduce.pm
> http://www.slideshare.net/philwhln/map-reduce-using-perl
> http://hadoop.apache.org/
> )
>
> Whatever you choose for IPC, I'd give consideration to how that
> mechanism could be used across a cluster of machines, so you have the
> option to scale up.
>
>  -Tom
>
> --
> Tom Metro
> Venture Logic, Newton, MA, USA
> "Enterprise solutions through open source."
> Professional Profile: http://tmetro.venturelogic.com/
>

_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Passing large complex data structures between process

Reply via email to