Hi! I and many (okay at least a few) others have shown interest in the complete history data of OSM. I understand that a lot of this data is available throughout the web using old snapshots and diffs but this comes in outdated formats and is by no way complete or easy to use. I also had a look at the System Admin page on the Wiki but I don't really know whom to contact, thus this post on the mailing list.
My question would be what would have to be done for a complete dump of the data. I read previous requests for this data and it seems as if there is no general objection to such a dump but that no one has written the proper tool for the job so far. As I have some free time on my hands (and about a hundred ideas/requests for the data for osmdoc) I'd be willing to at least _try_ to get something done. There are a few questions that probably need answering first and I hope we can start a discussion about this: - Am I correct in assuming that there are no general objections from the OSM server folks against such a dump? (Which would render the rest of this E-Mail useless ;-) - Is anyone else currently working on this? - Which format should the data be dumped in - Distribution of the data and storage space requirements - Interval of dumps * Format * 1) The easiest would be to just use the PostgreSQL COPY command (http://www.postgresql.org/docs/8.3/interactive/sql-copy.html). This would produce a file suitable to be read into any other PostgreSQL database with. Pros: - Easy to do - Probably one of the fastest options - Low overhead in the file formats Cons: - As far as I know there is no way to compress the data stream so everything would have to be written uncompressed first - The binary format is not really portable or easy to use, forced to use PostgreSQL as target, not able to filter data (Text formats available) - Even using text formats the data would be scattered (i.e. tags wouldn't be stored with the elements, node references wouldn't be stored with the ways, ...) - No OSM tools for this formats 2) A dump of all changesets in OsmChange mode (e.g. http://www.openstreetmap.org/api/0.6/changeset/3010332/download ). As I understand it Changesets have been created for every change. I don't quite understand why the first changeset (and nodes/ways) come from sometime in 2005 and not 2004 but I bet someone here can enlighten me. Pros: - Well known data format, many tools can work with OsmChange - Good if the user wants to rebuild/relive the change events as the Changesets should come roughly in the correct order/timeline - Possibility to split the process in multiple parts (e.g. history files with 50.000 changesets each) - Easy to update -> Just add the new changesets (with the long running transactions, that are 'haunting' the diffs, posing the same problem) Cons: - XML file size overhead (doesn't matter that much compressed) - Probably a lot slower than the COPY method - Custom code would have to be written to do this export but it shouldn't be to hard to iterate over every changeset. The necessary indexes already seem to exist - Potentially bad if one is interested mainly in the elements itself, the history data for a single element could be scattered throughout the whole file 3) A dump of all OSM elements in OSM format (http://www.openstreetmap.org/api/0.6/node/60078445/history) Pros: - Good if the user is interested in the elements and their history and not the "flow" of changes - Easily split in smaller files (nodes, ways, relations, changesets, further subdivided by id ranges or something else) - Easy to process although tools might not work out of the box Cons: - XML file size overhead, Custom code needed (or has Osmosis already the possibility to do this?), slower than COPY - This format has not that much tool support as far as I know (multiple versions of an element in a single file) - Best format to rebuild a "custom" database of OSM as it is grouped by element and not rather "arbitrarily" by Changeset/date - Not very easy to update, the whole process would have to be redone (or changesets would have to be examined) A few personal remarks: - I personally favor option 3) but that is mainly because of my requirements for osmdoc. - I don't see missing tool support as a big problem as I suspect that the majority of the users of this data will have/want their own tools do analyze or store the data (just guessing). *Distribution and space requirements* I really can't say much about this as I have no idea of the size of the database or the space available on the server(s). But I hope one of the admins can tell me more about this. The planet has been distributed using BitTorrent in the past so this might be a possible solution for the history dump but it really is too early to tell. *Interval of the dumps* Theoretically only one dump would be needed as there are now the replicate diffs which should provide every change to the database. But as they are - at the moment - only available in 'minute' format one might dump the history regularly (whatever that means, again depending on space requirement and if there is demand for this at all). I probably have forgotten some important aspects/problems/points and I hope to receive some feedback on this. I know that any "dump" program would have to be written in a way as to not interfere with normal operations (there is only one db server if I'm correct) but the current planet dump program probably gives a good indication about the load such a dump produces. Again, I have no data about this. Any pointers from the system administrators about the specifics and whom best to contact would be very welcome. Remarks about the data or its potential format (or possible uses for the data) are welcome too of course! Cheers, Lars _______________________________________________ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev