Mnesia is a library and not a DB server in itself. It also seems to be more optimised for storing smaller amounts of config/state data. I can definitely recommend nginx.
Of course I forgot to mention Amazon S3/SimpleDB, which are Amazon's equivalent to Google's storage APIs. Might be worth looking at those. Russ > -----Original Message----- > From: [email protected] [mailto:developers- > [email protected]] On Behalf Of Francis Irving > Sent: 17 August 2009 12:00 > To: [email protected]; mySociety public, general purpose discussion > list > Subject: Re: [mySociety:public] Distributed data storage and queueing > for Mapumental > > Thanks - have added Mnesia to my list of things to check. And nginx > does sound so much better than pound or haproxy - both of which I've > tried to use (with little success when under load) in the past. > > I would love to be able to use something like Google App Engine. > I think this manually configured virtual machines stage we're > all currently at is temporary - in the future our apps won't have a > clue what they're running on, they'll use an API like Google App > Engine. > > However, I don't think we can use it for Mapumental. We use > GDAL (http://gdal.org/) as a C library for rendering the tiles, > and our own C++ code for public transport route finding (see > https://secure.mysociety.org/cvstrac/rlog?f=mysociety/iso/bin/fastplan- > coopt.cpp) > > Neither can be run on Google App Engine. > > Francis > > On Mon, Aug 17, 2009 at 09:44:43AM +0100, Seb Bacon wrote: > > Hi Francis, > > > > I was talking with someone at work about Mnesia, which sounds like > > it's worth considering. It is distributed among N nodes, so it's good > > for problems that require good cache locality, i.e. do a lot with the > > data (because all data is on every node and replicates everywhere > > quickly). For some types of data sets that breaks down quite soon of > > course (you pretty much want to only have up to RAM-size large > > dataset, e.g. up to 64 GB). Mnesia cares about replication of changes > > all around, about failed notes, netsplits and syncing back from them > > etc. > > > > I don't know much about MongoDB or CouchDB. Maybe you have to manage > > syncing yourself on the application layer, but they probably scale > > much further (depending on what you do in your application). But you > > could also > > have smaller clusters of Mnesia nodes and application code > replicating > > between them and multiplying presence of buckets across the clusters > > that are requested often or something such. Another global Mnesia to > > hold routing information (which bucket where). > > > > So a combination might also make sense, Mnesia for the routing > > information on broker nodes and CouchDB or Memcached or MongoDB on > the > > storage nodes with the large blobs of tile and other precomputed > data. > > So your > > application severs would pick a broker node at random, ask it where > > some blob is and pass through the blob from the storage node to the > > client. The brokers could also increment per-object access counters > > and run some async jobs to have frequently accessed objects copied to > > more storage nodes etc. > > > > Instead of NFS for distributing tiles, you could consider a web > > service running off an httpd server like nginx. > > > > Another possibility for the entire infrastructure is Google App > > Engine, which utilises BigTable for fast, distributed data indexing > > and querying, and serves apps from a python or java runtime. There > is > > a queue API, a memcached API, a simple image manipulation API, and a > > very good pricing model, which works out considerably cheaper than > AWS > > for all models I've considered; for example, CPU time is > theoretically > > billed at the same rate in AWS and GAE, but in GAE you just pay for > > real CPU time, compared with AWS where you pay for instance uptime. > > Of course, the price you pay for the cheapness and free scaling in > GAE > > is lack of control, and lack of customer service, and no choice of > > where the data is stored (but I don't think mapumental has data > > privacy concerns...?) . The flip side to the lack of control is that > > the complexity is constrained. Personally I'm impressed by GAE and > > will be continuing to use it on new projects where I can, but I've > not > > used it on a massively resource-intensive job yet. The only part of a > > GAE app that isn't easily portable to a new architecture is the > > datastore access, which can be abstracted away easily enough, so you > > could always chose to migrate from GAE to AWS at a later date. > > > > Seb > > > > 2009/8/14 Francis Irving <[email protected]>: > > > Mapumental is a website which shows contour maps of public > transport > > > travel times, house prices and other data. It's in closed beta. > > > > > > http://mapumental.channel4.com/ > > > > > > It uses lots of CPU running the transport route finding for each > > > postcode, and rendering the tiles as they are served. > > > > > > Before we can openly release it, we need to make it scale easily > > > (say, on Amazon Web Services). > > > > > > Currently it is using > > > * A PostgreSQL database to store the points behind the static > datasets > > > such as scenicness and house prices. > > > * Binary files on NFS to store the generated datasets of travel > times. > > > PostgreSQL was too slow, and used too much memory, to load in the > > > large number of rows that would be required (300,000 for each user > entered > > > postcode). > > > * A rendered tile cache, containing PNG files on the NFS > filesystem. > > > * PostgreSQL for queueing the jobs for the transport route finder. > > > > > > We now want to: > > > * make the site scale easily (on Amazon Web Service), > > > * make it easy to add more data sets. > > > We had problems with NFS, so I need something to replace the binary > > > files in NFS and the tile cache. It might also be prudent to use > > > something easier to scale than a PostgreSQL database, although I > > > suspect the load on it would be low so perhaps it isn't a problem. > > > > > > So the new version of Mapumental that I'm currently plannning has > to > > > store: > > > a) cache of tiles rendered (some fairly generated rarely > > > and frequently accessed e.g. house prices, some not accessed > > > often compared to generation times, e.g. public transport route) > > > b) coordinates and values of arbitary point datasets (e.g. > > > school quality, asthma air quality, wind speed, route by > > > car to a particular postcode etc. etc.) > > > > > > I'm looking for good, open source, alternatives to NFS and > PostgreSQL > > > to do this. Distributed data stores and queueing systems. > > > > > > What should I look at? What can I trust? > > > > > > I've already surveyed the field, and have my own ideas about what > to > > > do, but would be interested if anyone here has some experience or > > > views on any of the obvious technologies. > > > > > > I'd like it to be stable and mature, and realistically it would > > > already be in a Debian package. > > > > > > Francis > > > > > > _______________________________________________ > > > Mailing list [email protected] > > > Archive, settings, or unsubscribe: > > > > https://secure.mysociety.org/admin/lists/mailman/listinfo/developers- > public > > > > > > > > > > > -- > > skype: seb.bacon > > mobile: 07790 939224 > > > > _______________________________________________ > > Mailing list [email protected] > > Archive, settings, or unsubscribe: > > https://secure.mysociety.org/admin/lists/mailman/listinfo/developers- > public > > > > _______________________________________________ > Mailing list [email protected] > Archive, settings, or unsubscribe: > https://secure.mysociety.org/admin/lists/mailman/listinfo/developers- > public _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
