Paul Houle said---

>    That said,  my new strategy for dealing with "large dump files" is
> to cut the file into segments (like 'split') and recompress the
> fragments.  If your processing chain allows it,  this can be a powerful
> way to get a concurrency speedup.  If more dump files were published in
> this format,  we could get the benefits of "parallel compression"
> without the cost.

This reminds me of an excellent solution to a similar
problem<http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html>that
may be applicable to dealing efficiently with the planet.osm file. It
comes from dealing with a similarly sized wikipedia english language bzip
file.

Basically you split it into chunks as you've already done but in addition
you build an index that tells you which are the first complete entries of
each chunk.  Then what you've got is O(1) searching of a huge binary file.
 Piping the output of bzcat to osmarender means you're seeking through the
entire file every time.  For Wikipedia at least the entries are
self-contained and in alphabetical order so this works.  It's a great idea
and allows a really fast offline wikipedia reader using all open source
tools. Conceivably someone could adapt that for more quickly working with
the planet.osm file.

Now there's probably several huge reasons the concept wouldn't work with the
planet.osm file, I don't know a thing about it's internal organization so I
can't say...

But perhaps there's some amount of data locality that can be exploited to
make this work.  If there's at least one type of information that we can use
to seek through the file and find perhaps a country or a boundary of some
sort, then it could be possible.

Your post reminded me of the wikipedia dump solution so I thought I'd
mention it.

Regards,
-DC
_______________________________________________
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk

Reply via email to