Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Russ Garrett Mon, 17 Aug 2009 04:42:40 -0700

Mnesia is a library and not a DB server in itself. It also seems to be more
optimised for storing smaller amounts of config/state data. I can definitely
recommend nginx.


Of course I forgot to mention Amazon S3/SimpleDB, which are Amazon's
equivalent to Google's storage APIs. Might be worth looking at those. 

Russ

> -----Original Message-----
> From: [email protected] [mailto:developers-
> [email protected]] On Behalf Of Francis Irving
> Sent: 17 August 2009 12:00
> To: [email protected]; mySociety public, general purpose discussion
> list
> Subject: Re: [mySociety:public] Distributed data storage and queueing
> for Mapumental
> 
> Thanks - have added Mnesia to my list of things to check. And nginx
> does sound so much better than pound or haproxy - both of which I've
> tried to use (with little success when under load) in the past.
> 
> I would love to be able to use something like Google App Engine.
> I think this manually configured virtual machines stage we're
> all currently at is temporary - in the future our apps won't have a
> clue what they're running on, they'll use an API like Google App
> Engine.
> 
> However, I don't think we can use it for Mapumental. We use
> GDAL (http://gdal.org/) as a C library for rendering the tiles,
> and our own C++ code for public transport route finding (see
> https://secure.mysociety.org/cvstrac/rlog?f=mysociety/iso/bin/fastplan-
> coopt.cpp)
> 
> Neither can be run on Google App Engine.
> 
> Francis
> 
> On Mon, Aug 17, 2009 at 09:44:43AM +0100, Seb Bacon wrote:
> > Hi Francis,
> >
> > I was talking with someone at work about Mnesia, which sounds like
> > it's worth considering. It is distributed among N nodes, so it's good
> > for problems that require good cache locality, i.e. do a lot with the
> > data (because all data is on every node and replicates everywhere
> > quickly). For some types of data sets that breaks down quite soon of
> > course (you pretty much want to only have up to RAM-size large
> > dataset, e.g. up to 64 GB). Mnesia cares about replication of changes
> > all around, about failed notes, netsplits and syncing back from them
> > etc.
> >
> > I don't know much about MongoDB or CouchDB. Maybe you have to manage
> > syncing yourself on the application layer, but they probably scale
> > much further (depending on what you do in your application). But you
> > could also
> > have smaller clusters of Mnesia nodes and application code
> replicating
> > between them and multiplying presence of buckets across the clusters
> > that are requested often or something such. Another global Mnesia to
> > hold routing information (which bucket where).
> >
> > So a combination might also make sense, Mnesia for the routing
> > information on broker nodes and CouchDB or Memcached or MongoDB on
> the
> > storage nodes with the large blobs of tile and other precomputed
> data.
> > So your
> > application severs would pick a broker node at random, ask it where
> > some blob is and pass through the blob from the storage node to the
> > client. The brokers could also increment per-object access counters
> > and run some async jobs to have frequently accessed objects copied to
> > more storage nodes etc.
> >
> > Instead of NFS for distributing tiles, you could consider a web
> > service running off an httpd server like nginx.
> >
> > Another possibility for the entire infrastructure is Google App
> > Engine, which utilises BigTable for fast, distributed data indexing
> > and querying, and serves apps from a python or java runtime.  There
> is
> > a queue API, a memcached API, a simple image manipulation API, and a
> > very good pricing model, which works out considerably cheaper than
> AWS
> > for all models I've considered; for example, CPU time is
> theoretically
> > billed at the same rate in AWS and GAE, but in GAE you just pay for
> > real CPU time, compared with AWS where you pay for instance uptime.
> > Of course, the price you pay for the cheapness and free scaling in
> GAE
> > is lack of control, and lack of customer service, and no choice of
> > where the data is stored (but I don't think mapumental has data
> > privacy concerns...?) . The flip side to the lack of control is that
> > the complexity is constrained.  Personally I'm impressed by GAE and
> > will be continuing to use it on new projects where I can, but I've
> not
> > used it on a massively resource-intensive job yet. The only part of a
> > GAE app that isn't easily portable to a new architecture is the
> > datastore access, which can be abstracted away easily enough, so you
> > could always chose to migrate from GAE to AWS at a later date.
> >
> > Seb
> >
> > 2009/8/14 Francis Irving <[email protected]>:
> > > Mapumental is a website which shows contour maps of public
> transport
> > > travel times, house prices and other data. It's in closed beta.
> > >
> > > http://mapumental.channel4.com/
> > >
> > > It uses lots of CPU running the transport route finding for each
> > > postcode, and rendering the tiles as they are served.
> > >
> > > Before we can openly release it, we need to make it scale easily
> > > (say, on Amazon Web Services).
> > >
> > > Currently it is using
> > > * A PostgreSQL database to store the points behind the static
> datasets
> > > such as scenicness and house prices.
> > > * Binary files on NFS to store the generated datasets of travel
> times.
> > > PostgreSQL was too slow, and used too much memory, to load in the
> > > large number of rows that would be required (300,000 for each user
> entered
> > > postcode).
> > > * A rendered tile cache, containing PNG files on the NFS
> filesystem.
> > > * PostgreSQL for queueing the jobs for the transport route finder.
> > >
> > > We now want to:
> > > * make the site scale easily (on Amazon Web Service),
> > > * make it easy to add more data sets.
> > > We had problems with NFS, so I need something to replace the binary
> > > files in NFS and the tile cache. It might also be prudent to use
> > > something easier to scale than a PostgreSQL database, although I
> > > suspect the load on it would be low so perhaps it isn't a problem.
> > >
> > > So the new version of Mapumental that I'm currently plannning has
> to
> > > store:
> > >    a) cache of tiles rendered (some fairly generated rarely
> > >    and frequently accessed e.g. house prices, some not accessed
> > >    often compared to generation times, e.g. public transport route)
> > >    b) coordinates and values of arbitary point datasets (e.g.
> > >    school quality, asthma air quality, wind speed, route by
> > >    car to a particular postcode etc. etc.)
> > >
> > > I'm looking for good, open source, alternatives to NFS and
> PostgreSQL
> > > to do this. Distributed data stores and queueing systems.
> > >
> > > What should I look at? What can I trust?
> > >
> > > I've already surveyed the field, and have my own ideas about what
> to
> > > do, but would be interested if anyone here has some experience or
> > > views on any of the obvious technologies.
> > >
> > > I'd like it to be stable and mature, and realistically it would
> > > already be in a Debian package.
> > >
> > > Francis
> > >
> > > _______________________________________________
> > > Mailing list [email protected]
> > > Archive, settings, or unsubscribe:
> > >
> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-
> public
> > >
> >
> >
> >
> > --
> > skype: seb.bacon
> > mobile: 07790 939224
> >
> > _______________________________________________
> > Mailing list [email protected]
> > Archive, settings, or unsubscribe:
> > https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-
> public
> >
> 
> _______________________________________________
> Mailing list [email protected]
> Archive, settings, or unsubscribe:
> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-
> public



_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Reply via email to