David Larochelle wrote: > [...] > We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored in > the database. The system contains a list of media sources with associated RSS > feeds. We have a downloads table that has all of the URLs that we want to > download or have downloaded in the past. This table currently has ~200 million > rows. We add downloads for the RSS feed of each source to the downloads > table a few times a day. When these RSS feeds are downloaded, they are > analyzed for new stories. We add download entries for each new story that we > encounter. We have an indexed enum field in the database that lets us know if > a download in the downloads table has already been downloaded, needs to be > downloaded, is queued in memory, or is currently being downloaded. We have > ~1.5 million URLs that need to be downloaded at anyone time.
This does sound like it will not scale. Which is only to say what you have said. Keeping all that data in one table means you will have to lock it for all your operations, preventing parallelism. Keeping all that data in one table means it will always get larger and slower. Keeping all that data in one table means conflicting optimization goals, and thus little optimization. I would suggest breaking your data up into a number of tables, such that the purpose of each would be more focused and amenable to optimizing for reading or writing -- for example, you could have one table of "needs", one of "queued", one of "downloading" and one of "downloaded", moving the data along from table to table in batches. Thus, at any given moment, your "needs" table would only contain 1.5 million rows, rather than 200 million -- it will scale with your "current workload" rather than having to scale with "all the work you've ever done". One could suggest having separate queue tables set up for each of your downloading systems. Thus your main "find work to do" query, which has the logic in it to avoid having too many URLs for a single target, would query the 1.5 million row "needs" table and move rows into the queue table associated with a downloader -- the downloader would simply need to perform a trivial "select" against those few-hundred/few-thousand rows, nigh-instantly getting its fresh load of work. As data is downloaded, each downloader could move rows from its own queue table to the ~200 million row "downloaded" table. Since every downloader would have its own table, they would not conflict on locks there. The write locks to insert into the downloaded table would not conflict with any read locks, as you wouldn't read from that table to find work to do. Indeed, you should avoid having any read-friendly indexes on that table -- by having it be non-indexed, inserting data into it will be as fast as possible. All of these steps could be made further more efficient by using batched queries and appropriate units of work -- for example, if a queue table held a hundred URLs, the downloading system could refrain from moving them into the downloaded table until they were all complete -- thus it would only need one "insert from"/"delete" to move those hundred records in one transaction -- and it would not have to actually send any URLs over the network back to the database. A further efficiency could be gained by allotting two queue tables to each downloader, such that at any given moment, the downloader was working to process one of them, while the queue-master was working to find work to fill the other. If you already have a significant investment in logic within this database, you could leverage that by using procedures & cursors to fill queue tables and switch queue tables while processing the flow of results from your "find work to do" logic -- cursors will generally get you access to the results as soon as possible, before the query is complete. By using separate tables, you could also migrate to such a system without impacting the current one, at the slight costs of some brief redundancy or the use of views -- depends on if you would rather consume time or space -- but essentially your database would work both ways simultaneously for the crossover. You could also split your "downloaded" table up in to multiple per-month/per-downloader/per-something tables -- if you have per-downloader tables here, that would avoid write-lock conflicts entirely -- and at any point where you want to select data out, a "union" across all tables, or periodic movement of the data into a single master table would suffice. Fundamentally, you want to put the data into tables which have indexes (or not) and access patterns which will be efficient for the role of the data at that point in time, and not simply leave it all in one pile of conflicting indexes & locks. > [...] Various other comments have mentioned ways to distribute the workload once you get it out of the database. In short, that would be a mistake. What that is basically suggesting is that you have a separate "table" of your active work, but that rather than keeping it in the database, where it may be extremely efficient, you slog it around on slow network connections and convert it into heavy Perl data structures. It is to suggest that you accept abysmal performance of the database as an unalterable facet of the natural world. While SQL servers have numerous flaws, they are, if used remotely properly, brilliant at facilitating parallel access to data -- you should be able to make this work very efficiently with just such a database if you structure your data for efficiency. _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm