Re: [Boston.pm] Passing large complex data structures between process
On Apr 3, 2013, at 10:34 AM, David Larochelle da...@larochelle.name wrote: Currently, the driver process periodically queries a database to get a list of URLs to crawler. It then stores these url's to be downloaded in a complex in memory and pipes them to separate processes that do the actual downloading. The problem is that the database queries are slow and block the driver process. Hi! The first thing I would do is to profile the database queries. If you have very slow database queries, it is likely there are steps that could speed them up. Possibly an index could help. Maybe do something with a realized view? Or possibly look at jiggering the database definition? How does the data get in -- is there a simple step you could take when adding data to the database that would put the URLs in some other table somewhere so that your query can change from a complex multi-join query to a simpler SELECT * FROM table? Ricky The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Passing large complex data structures between process
Thanks for all the feedback. I left out a lot of details about the system because I didn't want to complicate things. The purpose of the system is comprehensively study online media. We need the system to run 24 hours a day to download news articles in media sources such as the New York Times. We don't need to download articles instantly but we do need to make sure we don't miss articles and that we're able to download articles before they're taken down. We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored in the database. The system contains a list of media sources with associated RSS feeds. We have a downloads table that has all of the URLs that we want to download or have downloaded in the past. This table currently has ~200 million rows. We add downloads for the RSS feed of each source to the downloads table a few times a day. When these RSS feeds are downloaded, they are analyzed for new stories. We add download entries for each new story that we encounter. We have an indexed enum field in the database that lets us know if a download in the downloads table has already been downloaded, needs to be downloaded, is queued in memory, or is currently being downloaded. We have ~1.5 million URLs that need to be downloaded at anyone time. We have an engine process that forks off a bunch of fetcher processes. The engine process queries the database for downloads and stores them in an in memory queue. The engine goes through its in memory queue and then pipes the primary key of of downloads to fetcher processes that download and process them. When the in memory queue drops below a certain point, the engine queries the database for more downloads. This system worked fine for many years. However, we've recently increased the number of media sources and are running into problems. The problem is that the engine is a single process and while it's querying the database, no downloads can be passed to fetchers. We spent about a day trying to optimize the database queries but they're still slower than we would like. The exact reasons why the db queries are slow are complicated. Part of the problem is that some of the logic that prevents us from excessively downloading from any one site is encapsulated in SQL queries. My hope is to split the engine process into two pieces that ran in parallel: one to query the database and another to send downloads to fetchers. This way it won't matter how long the db query takes as long as we can get URLs from the database faster than we can download them. There's a case to be made for switching this system to a more distributed architecture. However, it's a large code base and we're very tried to Postgresql. The code this consumes the downloads also uses Postgesql. We've had some performance issues aggregating the date for which we've considering NoSQL solutions. But this would be a major rewrite and a version of the crawler needs to be available constantly. On Wed, Apr 3, 2013 at 5:43 PM, Tom Metro tmetro+boston...@gmail.comwrote: David Larochelle wrote: Currently, the driver process periodically queries a database to get a list of URLs to [crawl]. It then stores these url's to be downloaded in a complex in memory [structure?] and pipes them to separate processes that do the actual downloading. The problem is that the database queries are slow and block the driver process. Your description leaves out a lot of details about what sort of data is being passed back and forth between the different processing stages. So the first stage is querying a database to get a list of URLs. You then say it stores the URL in a complex memory structure? Why? Why isn't the input to the next stage simply a URL? If you were running a collection of crawler processes, and each had its own code to retrieve a URL from the database, it wouldn't matter if it blocked. Is there enough meta-data to allow independent crawler processes to each pick a URL without overlapping with the other processes? Another possibility is to create some middleware. A server that queries the database, builds a queue in memory, then accepts connections from the crawler processes and hands out a URL from the queue to each. Without know what sort of data you are exchanging, and how frequently, I can't say whether in-memory IPC and threads are good/necessary solutions, or if you'd be better off just running a bunch of independent processes. It's hard to say from what you've described so far, but this is sounding like a map-reduce problem. If you followed that algorithm, the first stage builds the list of URLs to crawl. The second stage spawns a pile of children to crawl the URLs individually or in batches, and produces intermediary results. The final stage aggregates the results. (There are some existing modules and tools you could use to implement this. Hadoop, for example. http://en.wikipedia.org/wiki/Map_reduce
Re: [Boston.pm] Passing large complex data structures between process
This sounds like a perfect fit for a queuing service like RabbitMQ. Logstash uses Redis lists for this as it's simple to setup and pretty reliable, but there are many such applications available. The queue's would allow multiple backend processes to check for and take items as they became available. There are Perl modules available for communicating with Redis (if you don't want to learn STOMP)...very easy to implement. Tony C On Thu, Apr 4, 2013 at 4:21 PM, David Larochelle da...@larochelle.namewrote: Thanks for all the feedback. I left out a lot of details about the system because I didn't want to complicate things. The purpose of the system is comprehensively study online media. We need the system to run 24 hours a day to download news articles in media sources such as the New York Times. We don't need to download articles instantly but we do need to make sure we don't miss articles and that we're able to download articles before they're taken down. We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored in the database. The system contains a list of media sources with associated RSS feeds. We have a downloads table that has all of the URLs that we want to download or have downloaded in the past. This table currently has ~200 million rows. We add downloads for the RSS feed of each source to the downloads table a few times a day. When these RSS feeds are downloaded, they are analyzed for new stories. We add download entries for each new story that we encounter. We have an indexed enum field in the database that lets us know if a download in the downloads table has already been downloaded, needs to be downloaded, is queued in memory, or is currently being downloaded. We have ~1.5 million URLs that need to be downloaded at anyone time. We have an engine process that forks off a bunch of fetcher processes. The engine process queries the database for downloads and stores them in an in memory queue. The engine goes through its in memory queue and then pipes the primary key of of downloads to fetcher processes that download and process them. When the in memory queue drops below a certain point, the engine queries the database for more downloads. This system worked fine for many years. However, we've recently increased the number of media sources and are running into problems. The problem is that the engine is a single process and while it's querying the database, no downloads can be passed to fetchers. We spent about a day trying to optimize the database queries but they're still slower than we would like. The exact reasons why the db queries are slow are complicated. Part of the problem is that some of the logic that prevents us from excessively downloading from any one site is encapsulated in SQL queries. My hope is to split the engine process into two pieces that ran in parallel: one to query the database and another to send downloads to fetchers. This way it won't matter how long the db query takes as long as we can get URLs from the database faster than we can download them. There's a case to be made for switching this system to a more distributed architecture. However, it's a large code base and we're very tried to Postgresql. The code this consumes the downloads also uses Postgesql. We've had some performance issues aggregating the date for which we've considering NoSQL solutions. But this would be a major rewrite and a version of the crawler needs to be available constantly. On Wed, Apr 3, 2013 at 5:43 PM, Tom Metro tmetro+boston...@gmail.com wrote: David Larochelle wrote: Currently, the driver process periodically queries a database to get a list of URLs to [crawl]. It then stores these url's to be downloaded in a complex in memory [structure?] and pipes them to separate processes that do the actual downloading. The problem is that the database queries are slow and block the driver process. Your description leaves out a lot of details about what sort of data is being passed back and forth between the different processing stages. So the first stage is querying a database to get a list of URLs. You then say it stores the URL in a complex memory structure? Why? Why isn't the input to the next stage simply a URL? If you were running a collection of crawler processes, and each had its own code to retrieve a URL from the database, it wouldn't matter if it blocked. Is there enough meta-data to allow independent crawler processes to each pick a URL without overlapping with the other processes? Another possibility is to create some middleware. A server that queries the database, builds a queue in memory, then accepts connections from the crawler processes and hands out a URL from the queue to each. Without know what sort of data you are exchanging, and how frequently, I can't say whether in-memory IPC and threads are good/necessary solutions, or if you'd be better
Re: [Boston.pm] Passing large complex data structures between process
On Thu, Apr 04, 2013 at 04:21:54PM -0400, David Larochelle wrote: My hope is to split the engine process into two pieces that ran in parallel: one to query the database and another to send downloads to fetchers. This way it won't matter how long the db query takes as long as we can get URLs from the database faster than we can download them. If this is, indeed the bottle neck, then I would think that splitting them into two communicating processes would solve the problem. Using files, as I mentioned in my previous email, is probably considered old school but is probably the simplest communication method. Unfortunately, it won't work once your system scales to multiple machines. As described, it sounds like the current system runs on a single machine, so this may not be an immediate problem. If you are planning to scale across machines, then I'd second the recommendation to use a message queue instead. -Gyepi ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Passing large complex data structures between process
David Larochelle wrote: [...] We're using Postgresql 8.4 and running on Ubuntu. Almost all data is stored in the database. The system contains a list of media sources with associated RSS feeds. We have a downloads table that has all of the URLs that we want to download or have downloaded in the past. This table currently has ~200 million rows. We add downloads for the RSS feed of each source to the downloads table a few times a day. When these RSS feeds are downloaded, they are analyzed for new stories. We add download entries for each new story that we encounter. We have an indexed enum field in the database that lets us know if a download in the downloads table has already been downloaded, needs to be downloaded, is queued in memory, or is currently being downloaded. We have ~1.5 million URLs that need to be downloaded at anyone time. This does sound like it will not scale. Which is only to say what you have said. Keeping all that data in one table means you will have to lock it for all your operations, preventing parallelism. Keeping all that data in one table means it will always get larger and slower. Keeping all that data in one table means conflicting optimization goals, and thus little optimization. I would suggest breaking your data up into a number of tables, such that the purpose of each would be more focused and amenable to optimizing for reading or writing -- for example, you could have one table of needs, one of queued, one of downloading and one of downloaded, moving the data along from table to table in batches. Thus, at any given moment, your needs table would only contain 1.5 million rows, rather than 200 million -- it will scale with your current workload rather than having to scale with all the work you've ever done. One could suggest having separate queue tables set up for each of your downloading systems. Thus your main find work to do query, which has the logic in it to avoid having too many URLs for a single target, would query the 1.5 million row needs table and move rows into the queue table associated with a downloader -- the downloader would simply need to perform a trivial select against those few-hundred/few-thousand rows, nigh-instantly getting its fresh load of work. As data is downloaded, each downloader could move rows from its own queue table to the ~200 million row downloaded table. Since every downloader would have its own table, they would not conflict on locks there. The write locks to insert into the downloaded table would not conflict with any read locks, as you wouldn't read from that table to find work to do. Indeed, you should avoid having any read-friendly indexes on that table -- by having it be non-indexed, inserting data into it will be as fast as possible. All of these steps could be made further more efficient by using batched queries and appropriate units of work -- for example, if a queue table held a hundred URLs, the downloading system could refrain from moving them into the downloaded table until they were all complete -- thus it would only need one insert from/delete to move those hundred records in one transaction -- and it would not have to actually send any URLs over the network back to the database. A further efficiency could be gained by allotting two queue tables to each downloader, such that at any given moment, the downloader was working to process one of them, while the queue-master was working to find work to fill the other. If you already have a significant investment in logic within this database, you could leverage that by using procedures cursors to fill queue tables and switch queue tables while processing the flow of results from your find work to do logic -- cursors will generally get you access to the results as soon as possible, before the query is complete. By using separate tables, you could also migrate to such a system without impacting the current one, at the slight costs of some brief redundancy or the use of views -- depends on if you would rather consume time or space -- but essentially your database would work both ways simultaneously for the crossover. You could also split your downloaded table up in to multiple per-month/per-downloader/per-something tables -- if you have per-downloader tables here, that would avoid write-lock conflicts entirely -- and at any point where you want to select data out, a union across all tables, or periodic movement of the data into a single master table would suffice. Fundamentally, you want to put the data into tables which have indexes (or not) and access patterns which will be efficient for the role of the data at that point in time, and not simply leave it all in one pile of conflicting indexes locks. [...] Various other comments have mentioned ways to distribute the workload once you get it out of the database. In short, that would be a mistake. What that is basically suggesting is that you have a separate table of your active work, but that