Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Jamie Wed, 19 Aug 2009 07:36:43 -0700

This is may be a bit new for your needs, but to the untrained eye looks
interesting


http://fluidinfo.com/fluiddb

2009/8/17 Seb Bacon <[email protected]>

> 2009/8/17 Francis Irving <[email protected]>:
> > However, I don't think we can use it for Mapumental. We use
> > GDAL (http://gdal.org/) as a C library for rendering the tiles,
> > and our own C++ code for public transport route finding (see
> >
> https://secure.mysociety.org/cvstrac/rlog?f=mysociety/iso/bin/fastplan-coopt.cpp
> )
> >
> > Neither can be run on Google App Engine.
>
> I suppose it wouldn't make sense to expose them as a web service in a
> different infrastructure..?
>
> Seb
>
>
> > On Mon, Aug 17, 2009 at 09:44:43AM +0100, Seb Bacon wrote:
> >> Hi Francis,
> >>
> >> I was talking with someone at work about Mnesia, which sounds like
> >> it's worth considering. It is distributed among N nodes, so it's good
> >> for problems that require good cache locality, i.e. do a lot with the
> >> data (because all data is on every node and replicates everywhere
> >> quickly). For some types of data sets that breaks down quite soon of
> >> course (you pretty much want to only have up to RAM-size large
> >> dataset, e.g. up to 64 GB). Mnesia cares about replication of changes
> >> all around, about failed notes, netsplits and syncing back from them
> >> etc.
> >>
> >> I don't know much about MongoDB or CouchDB. Maybe you have to manage
> >> syncing yourself on the application layer, but they probably scale
> >> much further (depending on what you do in your application). But you
> >> could also
> >> have smaller clusters of Mnesia nodes and application code replicating
> >> between them and multiplying presence of buckets across the clusters
> >> that are requested often or something such. Another global Mnesia to
> >> hold routing information (which bucket where).
> >>
> >> So a combination might also make sense, Mnesia for the routing
> >> information on broker nodes and CouchDB or Memcached or MongoDB on the
> >> storage nodes with the large blobs of tile and other precomputed data.
> >> So your
> >> application severs would pick a broker node at random, ask it where
> >> some blob is and pass through the blob from the storage node to the
> >> client. The brokers could also increment per-object access counters
> >> and run some async jobs to have frequently accessed objects copied to
> >> more storage nodes etc.
> >>
> >> Instead of NFS for distributing tiles, you could consider a web
> >> service running off an httpd server like nginx.
> >>
> >> Another possibility for the entire infrastructure is Google App
> >> Engine, which utilises BigTable for fast, distributed data indexing
> >> and querying, and serves apps from a python or java runtime.  There is
> >> a queue API, a memcached API, a simple image manipulation API, and a
> >> very good pricing model, which works out considerably cheaper than AWS
> >> for all models I've considered; for example, CPU time is theoretically
> >> billed at the same rate in AWS and GAE, but in GAE you just pay for
> >> real CPU time, compared with AWS where you pay for instance uptime.
> >> Of course, the price you pay for the cheapness and free scaling in GAE
> >> is lack of control, and lack of customer service, and no choice of
> >> where the data is stored (but I don't think mapumental has data
> >> privacy concerns...?) . The flip side to the lack of control is that
> >> the complexity is constrained.  Personally I'm impressed by GAE and
> >> will be continuing to use it on new projects where I can, but I've not
> >> used it on a massively resource-intensive job yet. The only part of a
> >> GAE app that isn't easily portable to a new architecture is the
> >> datastore access, which can be abstracted away easily enough, so you
> >> could always chose to migrate from GAE to AWS at a later date.
> >>
> >> Seb
> >>
> >> 2009/8/14 Francis Irving <[email protected]>:
> >> > Mapumental is a website which shows contour maps of public transport
> >> > travel times, house prices and other data. It's in closed beta.
> >> >
> >> > http://mapumental.channel4.com/
> >> >
> >> > It uses lots of CPU running the transport route finding for each
> >> > postcode, and rendering the tiles as they are served.
> >> >
> >> > Before we can openly release it, we need to make it scale easily
> >> > (say, on Amazon Web Services).
> >> >
> >> > Currently it is using
> >> > * A PostgreSQL database to store the points behind the static datasets
> >> > such as scenicness and house prices.
> >> > * Binary files on NFS to store the generated datasets of travel times.
> >> > PostgreSQL was too slow, and used too much memory, to load in the
> >> > large number of rows that would be required (300,000 for each user
> entered
> >> > postcode).
> >> > * A rendered tile cache, containing PNG files on the NFS filesystem.
> >> > * PostgreSQL for queueing the jobs for the transport route finder.
> >> >
> >> > We now want to:
> >> > * make the site scale easily (on Amazon Web Service),
> >> > * make it easy to add more data sets.
> >> > We had problems with NFS, so I need something to replace the binary
> >> > files in NFS and the tile cache. It might also be prudent to use
> >> > something easier to scale than a PostgreSQL database, although I
> >> > suspect the load on it would be low so perhaps it isn't a problem.
> >> >
> >> > So the new version of Mapumental that I'm currently plannning has to
> >> > store:
> >> >    a) cache of tiles rendered (some fairly generated rarely
> >> >    and frequently accessed e.g. house prices, some not accessed
> >> >    often compared to generation times, e.g. public transport route)
> >> >    b) coordinates and values of arbitary point datasets (e.g.
> >> >    school quality, asthma air quality, wind speed, route by
> >> >    car to a particular postcode etc. etc.)
> >> >
> >> > I'm looking for good, open source, alternatives to NFS and PostgreSQL
> >> > to do this. Distributed data stores and queueing systems.
> >> >
> >> > What should I look at? What can I trust?
> >> >
> >> > I've already surveyed the field, and have my own ideas about what to
> >> > do, but would be interested if anyone here has some experience or
> >> > views on any of the obvious technologies.
> >> >
> >> > I'd like it to be stable and mature, and realistically it would
> >> > already be in a Debian package.
> >> >
> >> > Francis
> >> >
> >> > _______________________________________________
> >> > Mailing list [email protected]
> >> > Archive, settings, or unsubscribe:
> >> >
> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
> >> >
> >>
> >>
> >>
> >> --
> >> skype: seb.bacon
> >> mobile: 07790 939224
> >>
> >> _______________________________________________
> >> Mailing list [email protected]
> >> Archive, settings, or unsubscribe:
> >>
> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
> >>
> >
>
>
>
> --
> skype: seb.bacon
> mobile: 07790 939224
>
> _______________________________________________
> Mailing list [email protected]
> Archive, settings, or unsubscribe:
> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
>

_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Reply via email to