Thanks Brian, This is pretty much what I was looking for.
Your calculations are correct but based on the assumption that at all zoom levels we will need all tiles generated. Given the sparsity of data, it actually results in only a few 100GBs. I'll run a second MR job with the map pushing to S3 then to make use of parallel loading. Cheers, Tim On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <bbock...@cse.unl.edu> wrote: > Hey Tim, > > Why don't you put the PNGs in a SequenceFile in the output of your reduce > task? You could then have a post-processing step that unpacks the PNG and > places it onto S3. (If my numbers are correct, you're looking at around 3TB > of data; is this right? With that much, you might want another separate Map > task to unpack all the files in parallel ... really depends on the > throughput you get to Amazon) > > Brian > > On Apr 14, 2009, at 4:35 AM, tim robertson wrote: > >> Hi all, >> >> I am currently processing a lot of raw CSV data and producing a >> summary text file which I load into mysql. On top of this I have a >> PHP application to generate tiles for google mapping (sample tile: >> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). >> Here is a (dev server) example of the final map client: >> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the >> dynamic grids as you zoom are all pre-calculated. >> >> I am considering (for better throughput as maps generate huge request >> volumes) pregenerating all my tiles (PNG) and storing them in S3 with >> cloudfront. There will be billions of PNGs produced each at 1-3KB >> each. >> >> Could someone please recommend the best place to generate the PNGs and >> when to push them to S3 in a MR system? >> If I did the PNG generation and upload to S3 in the reduce the same >> task on multiple machines will compete with each other right? Should >> I generate the PNGs to a local directory and then on Task success push >> the lot up? I am assuming billions of 1-3KB files on HDFS is not a >> good idea. >> >> I will use EC2 for the MR for the time being, but this will be moved >> to a local cluster still pushing to S3... >> >> Cheers, >> >> Tim > >