Hi Ben,

BS> This may be a little bit useless but...I use multi-pass and tiling
BS> techniques to get the planet under my control for my work...I'm
BS> using C but a 32-bit process, so these techniques should work with
BS> Java.

Yep that's basically what the splitter does too. I'm still considering whether 
an R-tree might help too to speed up area searches, though looking at where 
the bottlenecks are has highlighted the initial XML parsing as being the 
worst offender.

BS> - When I split the planet, I use a very compact spatial index to
BS> make sure I don't lose pieces (E.g. I maintain bounding boxes) ..
BS> this ensures that I take nodes outside a clipping box if the way
BS> passes through the clipping box.  With bit-packing, the indices
BS> (barely) fit in memory.

Makes sense. The splitter subdivides the planet down based on a maximum 
threshold 
of nodes per area, but rather than trying to keep an accurate bounding box 
for ways/rels, it just takes all nodes that fall within a certain amount 
of overlap of the area in question. mkgmap then clips the ways/rels to the 
exact boundary later. This seems to work very well in practice - not sure 
I've seen anything get lost at all as a result of this.

BS> - If your data can be aggregated later, tiling would get you out of
BS> jail...tile the planet yourself into a smaller number of chunks that
BS> can
BS> then happily be "tag indexed" entirely in RAM using your existing
BS> code.
BS> Then merge together the final tag counting results, which can
BS> hopefully be merged easily and cheaply.

The problem Lars might have is that some tiles would still contain a large 
number of unique tags. It's the # of unique tags that is the limiting factor 
on memory, not the aggregations.

BS> - One last idea: split the tag space and then merge that again.  Do
BS> one planet pass to count up all known tags into a hash table.  Then
BS> split this "planet schema" into several pieces by the maximum number
BS> of tags you want, and run a separate count pass on the planet
BS> reduced by the tags you care about.  Heck - you could even "filter"
BS> the planet down to those tags with something like grep so the planet
BS> XML parse goes by fast in each pass.  Either way it's the same
BS> idea...cut the dataset in a known way and glue together your
BS> counting later.

This is basically what I'd suggested too. Use the first pass over the planet 
file to split the tags into several groups (eg using a hashcode % desired 
group count). Then make further passes over the planet file, once for each 
group of tags, aggregating just that group on each pass. Experience has showed 
me that multiple passes over the XML is *slow* though, even with a good pull 
parser. You're better off preproccessing the data into a more suitable format 
on the first pass while you split the tags, then the successive passes will 
run much faster.

To give you some numbers as a starting point. Processing the full planet 
with splitter on a Core i7 takes just over a 4GB heap and 3 hours. The first 
pass (calculating the area subdivisions) needs the full 4GB (though I haven't 
optimised this fully yet). During this pass it writes out about 27GB of binary 
data. Two further passes are made over this binary data to write out the 
500 or so resultant (split) osm files (max 255 at a time). This latter stage 
requires less memory than the first, and if you're willing to make additional 
passes then the memory required here could easily be < 1GB without too large 
an impact on performance.

This image shows (badly!) how the planet gets subdivided: 
http://redyeti.net/osm/3000km.png 
 Note the high density across the US and Europe.

Chris




_______________________________________________
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev

Reply via email to