Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Seb Bacon Mon, 17 Aug 2009 05:04:00 -0700

2009/8/17 Francis Irving <[email protected]>:
> However, I don't think we can use it for Mapumental. We use
> GDAL (http://gdal.org/) as a C library for rendering the tiles,
> and our own C++ code for public transport route finding (see
> https://secure.mysociety.org/cvstrac/rlog?f=mysociety/iso/bin/fastplan-coopt.cpp)
>
> Neither can be run on Google App Engine.


I suppose it wouldn't make sense to expose them as a web service in a
different infrastructure..?

Seb


> On Mon, Aug 17, 2009 at 09:44:43AM +0100, Seb Bacon wrote:
>> Hi Francis,
>>
>> I was talking with someone at work about Mnesia, which sounds like
>> it's worth considering. It is distributed among N nodes, so it's good
>> for problems that require good cache locality, i.e. do a lot with the
>> data (because all data is on every node and replicates everywhere
>> quickly). For some types of data sets that breaks down quite soon of
>> course (you pretty much want to only have up to RAM-size large
>> dataset, e.g. up to 64 GB). Mnesia cares about replication of changes
>> all around, about failed notes, netsplits and syncing back from them
>> etc.
>>
>> I don't know much about MongoDB or CouchDB. Maybe you have to manage
>> syncing yourself on the application layer, but they probably scale
>> much further (depending on what you do in your application). But you
>> could also
>> have smaller clusters of Mnesia nodes and application code replicating
>> between them and multiplying presence of buckets across the clusters
>> that are requested often or something such. Another global Mnesia to
>> hold routing information (which bucket where).
>>
>> So a combination might also make sense, Mnesia for the routing
>> information on broker nodes and CouchDB or Memcached or MongoDB on the
>> storage nodes with the large blobs of tile and other precomputed data.
>> So your
>> application severs would pick a broker node at random, ask it where
>> some blob is and pass through the blob from the storage node to the
>> client. The brokers could also increment per-object access counters
>> and run some async jobs to have frequently accessed objects copied to
>> more storage nodes etc.
>>
>> Instead of NFS for distributing tiles, you could consider a web
>> service running off an httpd server like nginx.
>>
>> Another possibility for the entire infrastructure is Google App
>> Engine, which utilises BigTable for fast, distributed data indexing
>> and querying, and serves apps from a python or java runtime.  There is
>> a queue API, a memcached API, a simple image manipulation API, and a
>> very good pricing model, which works out considerably cheaper than AWS
>> for all models I've considered; for example, CPU time is theoretically
>> billed at the same rate in AWS and GAE, but in GAE you just pay for
>> real CPU time, compared with AWS where you pay for instance uptime.
>> Of course, the price you pay for the cheapness and free scaling in GAE
>> is lack of control, and lack of customer service, and no choice of
>> where the data is stored (but I don't think mapumental has data
>> privacy concerns...?) . The flip side to the lack of control is that
>> the complexity is constrained.  Personally I'm impressed by GAE and
>> will be continuing to use it on new projects where I can, but I've not
>> used it on a massively resource-intensive job yet. The only part of a
>> GAE app that isn't easily portable to a new architecture is the
>> datastore access, which can be abstracted away easily enough, so you
>> could always chose to migrate from GAE to AWS at a later date.
>>
>> Seb
>>
>> 2009/8/14 Francis Irving <[email protected]>:
>> > Mapumental is a website which shows contour maps of public transport
>> > travel times, house prices and other data. It's in closed beta.
>> >
>> > http://mapumental.channel4.com/
>> >
>> > It uses lots of CPU running the transport route finding for each
>> > postcode, and rendering the tiles as they are served.
>> >
>> > Before we can openly release it, we need to make it scale easily
>> > (say, on Amazon Web Services).
>> >
>> > Currently it is using
>> > * A PostgreSQL database to store the points behind the static datasets
>> > such as scenicness and house prices.
>> > * Binary files on NFS to store the generated datasets of travel times.
>> > PostgreSQL was too slow, and used too much memory, to load in the
>> > large number of rows that would be required (300,000 for each user entered
>> > postcode).
>> > * A rendered tile cache, containing PNG files on the NFS filesystem.
>> > * PostgreSQL for queueing the jobs for the transport route finder.
>> >
>> > We now want to:
>> > * make the site scale easily (on Amazon Web Service),
>> > * make it easy to add more data sets.
>> > We had problems with NFS, so I need something to replace the binary
>> > files in NFS and the tile cache. It might also be prudent to use
>> > something easier to scale than a PostgreSQL database, although I
>> > suspect the load on it would be low so perhaps it isn't a problem.
>> >
>> > So the new version of Mapumental that I'm currently plannning has to
>> > store:
>> >    a) cache of tiles rendered (some fairly generated rarely
>> >    and frequently accessed e.g. house prices, some not accessed
>> >    often compared to generation times, e.g. public transport route)
>> >    b) coordinates and values of arbitary point datasets (e.g.
>> >    school quality, asthma air quality, wind speed, route by
>> >    car to a particular postcode etc. etc.)
>> >
>> > I'm looking for good, open source, alternatives to NFS and PostgreSQL
>> > to do this. Distributed data stores and queueing systems.
>> >
>> > What should I look at? What can I trust?
>> >
>> > I've already surveyed the field, and have my own ideas about what to
>> > do, but would be interested if anyone here has some experience or
>> > views on any of the obvious technologies.
>> >
>> > I'd like it to be stable and mature, and realistically it would
>> > already be in a Debian package.
>> >
>> > Francis
>> >
>> > _______________________________________________
>> > Mailing list [email protected]
>> > Archive, settings, or unsubscribe:
>> > https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
>> >
>>
>>
>>
>> --
>> skype: seb.bacon
>> mobile: 07790 939224
>>
>> _______________________________________________
>> Mailing list [email protected]
>> Archive, settings, or unsubscribe:
>> https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
>>
>



-- 
skype: seb.bacon
mobile: 07790 939224

_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Re: [mySociety:public] Distributed data storage and queueing for Mapumental

Reply via email to