Re: [Boston.pm] Passing large complex data structures between process

John Redford Thu, 04 Apr 2013 20:44:13 -0700

David Larochelle wrote:
> 
[...]
> We're using Postgresql 8.4 and running on Ubuntu. Almost all data is
stored in
> the database. The system contains a list of media sources with associated
RSS
> feeds. We have a downloads table that has all of the URLs that we want to
> download or have downloaded in the past. This table currently has ~200
million
> rows. We add downloads for the RSS feed of each source to the downloads
> table a few times a day. When these RSS feeds are downloaded, they are
> analyzed for new stories. We add download entries for each new story that
we
> encounter.  We have an indexed enum field in the database that lets us
know if
> a download in the downloads table has already been downloaded, needs to be
> downloaded, is queued in memory, or is currently being downloaded. We have
> ~1.5 million URLs that need to be downloaded at anyone time.


This does sound like it will not scale. Which is only to say what you have
said.

Keeping all that data in one table means you will have to lock it for all
your operations, preventing parallelism.  Keeping all that data in one table
means it will always get larger and slower.  Keeping all that data in one
table means conflicting optimization goals, and thus little optimization.

I would suggest breaking your data up into a number of tables, such that the
purpose of each would be more focused and amenable to optimizing for reading
or writing -- for example, you could have one table of "needs", one of
"queued", one of "downloading" and one of "downloaded", moving the data
along from table to table in batches.  Thus, at any given moment, your
"needs" table would only contain 1.5 million rows, rather than 200 million
-- it will scale with your "current workload" rather than having to scale
with "all the work you've ever done".

One could suggest having separate queue tables set up for each of your
downloading systems.  Thus your main "find work to do" query, which has the
logic in it to avoid having too many URLs for a single target, would query
the 1.5 million row "needs" table and move rows into the queue table
associated with a downloader -- the downloader would simply need to perform
a trivial "select" against those few-hundred/few-thousand rows,
nigh-instantly getting its fresh load of work.  As data is downloaded, each
downloader could move rows from its own queue table to the ~200 million row
"downloaded" table.

Since every downloader would have its own table, they would not conflict on
locks there.  The write locks to insert into the downloaded table would not
conflict with any read locks, as you wouldn't read from that table to find
work to do.  Indeed, you should avoid having any read-friendly indexes on
that table -- by having it be non-indexed, inserting data into it will be as
fast as possible.

All of these steps could be made further more efficient by using batched
queries and appropriate units of work -- for example, if a queue table held
a hundred URLs, the downloading system could refrain from moving them into
the downloaded table until they were all complete -- thus it would only need
one "insert from"/"delete" to move those hundred records in one transaction
-- and it would not have to actually send any URLs over the network back to
the database. A further efficiency could be gained by allotting two queue
tables to each downloader, such that at any given moment, the downloader was
working to process one of them, while the queue-master was working to find
work to fill the other.

If you already have a significant investment in logic within this database,
you could leverage that by using procedures & cursors to fill queue tables
and switch queue tables while processing the flow of results from your "find
work to do" logic -- cursors will generally get you access to the results as
soon as possible, before the query is complete.

By using separate tables, you could also migrate to such a system without
impacting the current one, at the slight costs of some brief redundancy or
the use of views -- depends on if you would rather consume time or space --
but essentially your database would work both ways simultaneously for the
crossover.

You could also split your "downloaded" table up in to multiple
per-month/per-downloader/per-something tables -- if you have per-downloader
tables here, that would avoid write-lock conflicts entirely -- and at any
point where you want to select data out, a "union" across all tables, or
periodic movement of the data into a single master table would suffice.

Fundamentally, you want to put the data into tables which have indexes (or
not) and access patterns which will be efficient for the role of the data at
that point in time, and not simply leave it all in one pile of conflicting
indexes & locks.

> [...]

Various other comments have mentioned ways to distribute the workload once
you get it out of the database.  In short, that would be a mistake.  What
that is basically suggesting is that you have a separate "table" of your
active work, but that rather than keeping it in the database, where it may
be extremely efficient, you slog it around on slow network connections and
convert it into heavy Perl data structures. It is to suggest that you accept
abysmal performance of the database as an unalterable facet of the natural
world.

While SQL servers have numerous flaws, they are, if used remotely properly,
brilliant at facilitating parallel access to data -- you should be able to
make this work very efficiently with just such a database if you structure
your data for efficiency.


_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Passing large complex data structures between process

Reply via email to