Hi Ben, BS> This may be a little bit useless but...I use multi-pass and tiling BS> techniques to get the planet under my control for my work...I'm BS> using C but a 32-bit process, so these techniques should work with BS> Java.
Yep that's basically what the splitter does too. I'm still considering whether an R-tree might help too to speed up area searches, though looking at where the bottlenecks are has highlighted the initial XML parsing as being the worst offender. BS> - When I split the planet, I use a very compact spatial index to BS> make sure I don't lose pieces (E.g. I maintain bounding boxes) .. BS> this ensures that I take nodes outside a clipping box if the way BS> passes through the clipping box. With bit-packing, the indices BS> (barely) fit in memory. Makes sense. The splitter subdivides the planet down based on a maximum threshold of nodes per area, but rather than trying to keep an accurate bounding box for ways/rels, it just takes all nodes that fall within a certain amount of overlap of the area in question. mkgmap then clips the ways/rels to the exact boundary later. This seems to work very well in practice - not sure I've seen anything get lost at all as a result of this. BS> - If your data can be aggregated later, tiling would get you out of BS> jail...tile the planet yourself into a smaller number of chunks that BS> can BS> then happily be "tag indexed" entirely in RAM using your existing BS> code. BS> Then merge together the final tag counting results, which can BS> hopefully be merged easily and cheaply. The problem Lars might have is that some tiles would still contain a large number of unique tags. It's the # of unique tags that is the limiting factor on memory, not the aggregations. BS> - One last idea: split the tag space and then merge that again. Do BS> one planet pass to count up all known tags into a hash table. Then BS> split this "planet schema" into several pieces by the maximum number BS> of tags you want, and run a separate count pass on the planet BS> reduced by the tags you care about. Heck - you could even "filter" BS> the planet down to those tags with something like grep so the planet BS> XML parse goes by fast in each pass. Either way it's the same BS> idea...cut the dataset in a known way and glue together your BS> counting later. This is basically what I'd suggested too. Use the first pass over the planet file to split the tags into several groups (eg using a hashcode % desired group count). Then make further passes over the planet file, once for each group of tags, aggregating just that group on each pass. Experience has showed me that multiple passes over the XML is *slow* though, even with a good pull parser. You're better off preproccessing the data into a more suitable format on the first pass while you split the tags, then the successive passes will run much faster. To give you some numbers as a starting point. Processing the full planet with splitter on a Core i7 takes just over a 4GB heap and 3 hours. The first pass (calculating the area subdivisions) needs the full 4GB (though I haven't optimised this fully yet). During this pass it writes out about 27GB of binary data. Two further passes are made over this binary data to write out the 500 or so resultant (split) osm files (max 255 at a time). This latter stage requires less memory than the first, and if you're willing to make additional passes then the memory required here could easily be < 1GB without too large an impact on performance. This image shows (badly!) how the planet gets subdivided: http://redyeti.net/osm/3000km.png Note the high density across the US and Europe. Chris _______________________________________________ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev