Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread Morse, Richard E.MGH
On Apr 3, 2013, at 10:34 AM, David Larochelle da...@larochelle.name wrote:

 Currently, the driver process periodically  queries a database to get a
 list of URLs to crawler. It then stores these url's to be downloaded in a
 complex in memory and pipes them to separate processes that do the actual
 downloading. The problem is that the database queries are slow and block
 the driver process.

Hi! The first thing I would do is to profile the database queries. If you have 
very slow database queries, it is likely there are steps that could speed them 
up. Possibly an index could help. Maybe do something with a realized view? Or 
possibly look at jiggering the database definition? How does the data get in -- 
is there a simple step you could take when adding data to the database that 
would put the URLs in some other table somewhere so that your query can change 
from a complex multi-join query to a simpler SELECT * FROM table?

Ricky


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.


___
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread David Larochelle
Thanks for all the feedback. I left out a lot of details about the system
because I didn't want to complicate things.

The purpose of the system is comprehensively study online media. We need
the system to run 24 hours a day to download news articles in media sources
such as the New York Times. We don't need to download articles instantly
but we do need to make sure we don't miss articles and that we're able to
download articles before they're taken down.

We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored
in the database. The system contains a list of media sources with
associated RSS feeds. We have a downloads table that has all of the URLs
that we want to download or have downloaded in the past. This table
currently has ~200 million rows. We add downloads for the RSS feed of each
source to the downloads table a few times a day. When these RSS feeds are
downloaded, they are analyzed for new stories. We add download entries for
each new story that we encounter.  We have an indexed enum field in the
database that lets us know if a download in the downloads table has already
been downloaded, needs to be downloaded, is queued in memory, or is
currently being downloaded. We have ~1.5 million URLs that need to be
downloaded at anyone time.

We have an engine process that forks off a bunch of fetcher processes. The
engine process queries the database for downloads and stores them in an in
memory queue. The engine goes through its in memory queue and then pipes
the primary key of of downloads to fetcher processes that download and
process them. When the in memory queue drops below a certain point, the
engine queries the database for more downloads.

This system worked fine for many years. However, we've recently increased
the number of media sources and are running into problems. The problem is
that the engine is a single process and while it's querying the database,
no downloads can be passed to fetchers.  We spent about a day trying to
optimize the database queries but they're still slower than we would like.
The exact reasons why the db queries are slow are complicated. Part of the
problem is that some of the logic that prevents us from excessively
downloading from any one site is encapsulated in SQL queries.

My hope is to split the engine process into two pieces that ran in
parallel: one to query the database and another to send downloads to
fetchers. This way it won't matter how long the db query takes as long as
we can get URLs from the database faster than we can download them.

There's a case to be made for switching this system to a more distributed
architecture. However, it's a large code base and we're very tried to
Postgresql. The code this consumes the downloads also uses Postgesql.

We've had some performance issues aggregating the date for which we've
considering NoSQL solutions.  But this would be a major rewrite and a
version of the crawler needs to be available constantly.


On Wed, Apr 3, 2013 at 5:43 PM, Tom Metro tmetro+boston...@gmail.comwrote:

 David Larochelle wrote:
  Currently, the driver process periodically queries a database to get
  a list of URLs to [crawl]. It then stores these url's to be
  downloaded in a complex in memory [structure?] and pipes them to
  separate processes that do the actual downloading.
 
  The problem is that the database queries are slow and block
  the driver process.

 Your description leaves out a lot of details about what sort of data is
 being passed back and forth between the different processing stages.

 So the first stage is querying a database to get a list of URLs. You
 then say it stores the URL in a complex memory structure? Why? Why isn't
 the input to the next stage simply a URL?

 If you were running a collection of crawler processes, and each had its
 own code to retrieve a URL from the database, it wouldn't matter if it
 blocked. Is there enough meta-data to allow independent crawler
 processes to each pick a URL without overlapping with the other processes?

 Another possibility is to create some middleware. A server that queries
 the database, builds a queue in memory, then accepts connections from
 the crawler processes and hands out a URL from the queue to each.

 Without know what sort of data you are exchanging, and how frequently, I
 can't say whether in-memory IPC and threads are good/necessary
 solutions, or if you'd be better off just running a bunch of independent
 processes.

 It's hard to say from what you've described so far, but this is sounding
 like a map-reduce problem. If you followed that algorithm, the first
 stage builds the list of URLs to crawl. The second stage spawns a pile
 of children to crawl the URLs individually or in batches, and produces
 intermediary results. The final stage aggregates the results.

 (There are some existing modules and tools you could use to implement
 this. Hadoop, for example.
 http://en.wikipedia.org/wiki/Map_reduce

 

Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread Anthony Caravello
This sounds like a perfect fit for a queuing service like RabbitMQ.
Logstash uses Redis lists for this as it's simple to setup and pretty
reliable, but there are many such applications available.  The queue's
would allow multiple backend processes to check for and take items as they
became available.

There are Perl modules available for communicating with Redis (if you don't
want to learn STOMP)...very easy to implement.

Tony C

On Thu, Apr 4, 2013 at 4:21 PM, David Larochelle da...@larochelle.namewrote:

 Thanks for all the feedback. I left out a lot of details about the system
 because I didn't want to complicate things.

 The purpose of the system is comprehensively study online media. We need
 the system to run 24 hours a day to download news articles in media sources
 such as the New York Times. We don't need to download articles instantly
 but we do need to make sure we don't miss articles and that we're able to
 download articles before they're taken down.

 We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored
 in the database. The system contains a list of media sources with
 associated RSS feeds. We have a downloads table that has all of the URLs
 that we want to download or have downloaded in the past. This table
 currently has ~200 million rows. We add downloads for the RSS feed of each
 source to the downloads table a few times a day. When these RSS feeds are
 downloaded, they are analyzed for new stories. We add download entries for
 each new story that we encounter.  We have an indexed enum field in the
 database that lets us know if a download in the downloads table has already
 been downloaded, needs to be downloaded, is queued in memory, or is
 currently being downloaded. We have ~1.5 million URLs that need to be
 downloaded at anyone time.

 We have an engine process that forks off a bunch of fetcher processes. The
 engine process queries the database for downloads and stores them in an in
 memory queue. The engine goes through its in memory queue and then pipes
 the primary key of of downloads to fetcher processes that download and
 process them. When the in memory queue drops below a certain point, the
 engine queries the database for more downloads.

 This system worked fine for many years. However, we've recently increased
 the number of media sources and are running into problems. The problem is
 that the engine is a single process and while it's querying the database,
 no downloads can be passed to fetchers.  We spent about a day trying to
 optimize the database queries but they're still slower than we would like.
 The exact reasons why the db queries are slow are complicated. Part of the
 problem is that some of the logic that prevents us from excessively
 downloading from any one site is encapsulated in SQL queries.

 My hope is to split the engine process into two pieces that ran in
 parallel: one to query the database and another to send downloads to
 fetchers. This way it won't matter how long the db query takes as long as
 we can get URLs from the database faster than we can download them.

 There's a case to be made for switching this system to a more distributed
 architecture. However, it's a large code base and we're very tried to
 Postgresql. The code this consumes the downloads also uses Postgesql.

 We've had some performance issues aggregating the date for which we've
 considering NoSQL solutions.  But this would be a major rewrite and a
 version of the crawler needs to be available constantly.


 On Wed, Apr 3, 2013 at 5:43 PM, Tom Metro tmetro+boston...@gmail.com
 wrote:

  David Larochelle wrote:
   Currently, the driver process periodically queries a database to get
   a list of URLs to [crawl]. It then stores these url's to be
   downloaded in a complex in memory [structure?] and pipes them to
   separate processes that do the actual downloading.
  
   The problem is that the database queries are slow and block
   the driver process.
 
  Your description leaves out a lot of details about what sort of data is
  being passed back and forth between the different processing stages.
 
  So the first stage is querying a database to get a list of URLs. You
  then say it stores the URL in a complex memory structure? Why? Why isn't
  the input to the next stage simply a URL?
 
  If you were running a collection of crawler processes, and each had its
  own code to retrieve a URL from the database, it wouldn't matter if it
  blocked. Is there enough meta-data to allow independent crawler
  processes to each pick a URL without overlapping with the other
 processes?
 
  Another possibility is to create some middleware. A server that queries
  the database, builds a queue in memory, then accepts connections from
  the crawler processes and hands out a URL from the queue to each.
 
  Without know what sort of data you are exchanging, and how frequently, I
  can't say whether in-memory IPC and threads are good/necessary
  solutions, or if you'd be better 

Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread Gyepi SAM
On Thu, Apr 04, 2013 at 04:21:54PM -0400, David Larochelle wrote:
 My hope is to split the engine process into two pieces that ran in
 parallel: one to query the database and another to send downloads to
 fetchers. This way it won't matter how long the db query takes as long as
 we can get URLs from the database faster than we can download them.

If this is, indeed the bottle neck, then I would think that splitting them
into two communicating processes would solve the problem.

Using files, as I mentioned in my previous email, is probably considered
old school but is probably the simplest communication method.
Unfortunately, it won't work once your system scales to multiple machines.
As described, it sounds like the current system runs on a single machine,
so this may not be an immediate problem. If you are planning to scale across
machines, then I'd second the recommendation to use a message queue instead.

-Gyepi

___
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Passing large complex data structures between process

2013-04-04 Thread John Redford

David Larochelle wrote:
 
[...]
 We're using Postgresql 8.4 and running on Ubuntu. Almost all data is
stored in
 the database. The system contains a list of media sources with associated
RSS
 feeds. We have a downloads table that has all of the URLs that we want to
 download or have downloaded in the past. This table currently has ~200
million
 rows. We add downloads for the RSS feed of each source to the downloads
 table a few times a day. When these RSS feeds are downloaded, they are
 analyzed for new stories. We add download entries for each new story that
we
 encounter.  We have an indexed enum field in the database that lets us
know if
 a download in the downloads table has already been downloaded, needs to be
 downloaded, is queued in memory, or is currently being downloaded. We have
 ~1.5 million URLs that need to be downloaded at anyone time.

This does sound like it will not scale. Which is only to say what you have
said.

Keeping all that data in one table means you will have to lock it for all
your operations, preventing parallelism.  Keeping all that data in one table
means it will always get larger and slower.  Keeping all that data in one
table means conflicting optimization goals, and thus little optimization.

I would suggest breaking your data up into a number of tables, such that the
purpose of each would be more focused and amenable to optimizing for reading
or writing -- for example, you could have one table of needs, one of
queued, one of downloading and one of downloaded, moving the data
along from table to table in batches.  Thus, at any given moment, your
needs table would only contain 1.5 million rows, rather than 200 million
-- it will scale with your current workload rather than having to scale
with all the work you've ever done.

One could suggest having separate queue tables set up for each of your
downloading systems.  Thus your main find work to do query, which has the
logic in it to avoid having too many URLs for a single target, would query
the 1.5 million row needs table and move rows into the queue table
associated with a downloader -- the downloader would simply need to perform
a trivial select against those few-hundred/few-thousand rows,
nigh-instantly getting its fresh load of work.  As data is downloaded, each
downloader could move rows from its own queue table to the ~200 million row
downloaded table.

Since every downloader would have its own table, they would not conflict on
locks there.  The write locks to insert into the downloaded table would not
conflict with any read locks, as you wouldn't read from that table to find
work to do.  Indeed, you should avoid having any read-friendly indexes on
that table -- by having it be non-indexed, inserting data into it will be as
fast as possible.

All of these steps could be made further more efficient by using batched
queries and appropriate units of work -- for example, if a queue table held
a hundred URLs, the downloading system could refrain from moving them into
the downloaded table until they were all complete -- thus it would only need
one insert from/delete to move those hundred records in one transaction
-- and it would not have to actually send any URLs over the network back to
the database. A further efficiency could be gained by allotting two queue
tables to each downloader, such that at any given moment, the downloader was
working to process one of them, while the queue-master was working to find
work to fill the other.

If you already have a significant investment in logic within this database,
you could leverage that by using procedures  cursors to fill queue tables
and switch queue tables while processing the flow of results from your find
work to do logic -- cursors will generally get you access to the results as
soon as possible, before the query is complete.

By using separate tables, you could also migrate to such a system without
impacting the current one, at the slight costs of some brief redundancy or
the use of views -- depends on if you would rather consume time or space --
but essentially your database would work both ways simultaneously for the
crossover.

You could also split your downloaded table up in to multiple
per-month/per-downloader/per-something tables -- if you have per-downloader
tables here, that would avoid write-lock conflicts entirely -- and at any
point where you want to select data out, a union across all tables, or
periodic movement of the data into a single master table would suffice.

Fundamentally, you want to put the data into tables which have indexes (or
not) and access patterns which will be efficient for the role of the data at
that point in time, and not simply leave it all in one pile of conflicting
indexes  locks.

 [...]

Various other comments have mentioned ways to distribute the workload once
you get it out of the database.  In short, that would be a mistake.  What
that is basically suggesting is that you have a separate table of your
active work, but that