Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-12-16 Thread Jukka Rahkonen
Hi,

Have you been considering how to handle the history of old anonymous edits? 
This new history data should not reveal those user names but keep them 
anonymous.

-Jukka Rahkonen-


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-12-16 Thread Lars Francke
 Have you been considering how to handle the history of old anonymous edits?
 This new history data should not reveal those user names but keep them 
 anonymous.

User IDs and Usernames for those anonymous edits are left out of the
dump. So the elements may not have an uid or user attribute.

These are the relevant lines from the source code:
http://bitbucket.org/lfrancke/historydump/src/tip/src/main/java/org/openstreetmap/util/Dumper.java#cl-176
http://bitbucket.org/lfrancke/historydump/src/tip/src/main/java/org/openstreetmap/util/Dumper.java#cl-230
http://bitbucket.org/lfrancke/historydump/src/tip/src/main/java/org/openstreetmap/util/Dumper.java#cl-441

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-12 Thread Tom Hughes
On 12/11/09 16:28, Lars Francke wrote:

 I am partly done with my Java version. There are a few
 questions/problems/remarks:

Is Java really up to this job from a performance point of view?

 - Is there a dump of the database available from just prior to the
 switch from API 0.4 to 0.5? I could try to use that to merge the
 history of the segments to the ways (as briefly discussed by Frederik)

There is a dump, but it's a mysql dump so not easily readable. There may 
be a planet around somewhere as well but I don't think it will be 
synchronised to the actual shutdown time or have any history.

 - Any information on the size (in rows) of the tables would be nice
 (for testing purposes)

It should be fairly obvious for the main tables as they just contain a 
row for each object.

 - What is the default_statistics_target for the columns/tables in
 question? Are there any other options set that would affect the query
 planner? I've seen the query planner make wildly inappropriate
 decisions so I'll try to check if the statements I use will work. I
 used the same technique as planet.c and only adapted the queries to
 versions and history tables.

I'm not quite sure what you think knowing the value of that setting is 
going to help with. You only need to worry about optimising your queries 
if it turns out the planner gets them wrong but it's rarely a problem 
with Postgres especially with the kind of simple queries a dumper uses.

 - Do I have to take precautions in regards to database/machine/disk
 load? I could do something like the Auto-Vacuum daemon[2] or
 monitoring the load.

Auto vacuum is on by default these days I believe. It's not something an 
ordinary user has any control over anyway.

Tom

-- 
Tom Hughes (t...@compton.nu)
http://www.compton.nu/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-12 Thread Lars Francke
 I am partly done with my Java version. There are a few
 questions/problems/remarks:

 Is Java really up to this job from a performance point of view?

I haven't done any performance comparisons between planet.c and my
program but I believe that there won't be much of a difference. I
could rip out the history part and produce a current planet only
program to compare the speeds. I've taken care to select a XML Writer
that is known for its performance. There is Aalto[1] but I've never
used it before so I'm hesitant to use it. I could put the database
reading and the XML writing stuff in two different threads but again
I'm not sure if that'd help or hurt. I know that this is quite a
religious topic but I rarely see big differences in speed between C
and Java. Especially in this case as most of the time will probably be
spent in I/O.

 - Is there a dump of the database available from just prior to the
 switch from API 0.4 to 0.5? I could try to use that to merge the
 history of the segments to the ways (as briefly discussed by Frederik)

 There is a dump, but it's a mysql dump so not easily readable. There may be
 a planet around somewhere as well but I don't think it will be synchronised
 to the actual shutdown time or have any history.

If you'd be willing to share the mysql dump (I of course wouldn't need
user- or any other sensitive data) I'd try my best. It can't hurt.
Planet won't be as useful because the history is missing.

 - Any information on the size (in rows) of the tables would be nice
 (for testing purposes)

 It should be fairly obvious for the main tables as they just contain a row
 for each object.

For the main (current_*) tables, yes. But not for the history tables.
I have no estimate how may versions there are .I could count the
current versions from all elements but if you have a number that'd be
great.

 - What is the default_statistics_target for the columns/tables in
 question? Are there any other options set that would affect the query
 planner? I've seen the query planner make wildly inappropriate
 decisions so I'll try to check if the statements I use will work. I
 used the same technique as planet.c and only adapted the queries to
 versions and history tables.

 I'm not quite sure what you think knowing the value of that setting is going
 to help with. You only need to worry about optimising your queries if it
 turns out the planner gets them wrong but it's rarely a problem with
 Postgres especially with the kind of simple queries a dumper uses.

I've had problems in the past with exactly this. Easy queries
resorting to thousands of seqscans. I just want to configure my
database as close as possible to the live one for tests. I don't think
there'll be problems either but it doesn't hurt to check.

The statements will look something like this:
1)
SELECT n.id, n.version, n.timestamp, n.changeset_id, c.user_id,
n.visible, n.latitude, n.longitude
FROM nodes n
JOIN changesets c ON n.changeset_id=c.id
ORDER BY n.id, n.version

2) SELECT id, version, k, v FROM node_tags ORDER BY id, version, k

Perhaps you could just check them?

 - Do I have to take precautions in regards to database/machine/disk
 load? I could do something like the Auto-Vacuum daemon[2] or
 monitoring the load.

 Auto vacuum is on by default these days I believe. It's not something an
 ordinary user has any control over anyway.

That's not what I meant but I was unclear :)
I meant that I could use the same method as the Auto-Vacuum daemon
which pauses regularly (cost-based) to alleviate load. Just as an
example.

Again: Until now the program just reads from the DB and dumps to the
output stream. No special concerns as to the statements or the
performance/load. I just want to pick the low hanging fruit as early
as possible and those were the questions I thought of. Some of them
(especially those about the query planner) came from problems I've
experienced with osmdoc.

In the end it is up to you (or whoever decides that) if you want to
use my program, write one from scratch or adapt planet.c to dump the
history. I don't really care either way as long as the end result is
that we eventually have historical OSM data ;-)

Lars

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-12 Thread Matt Amos
On Thu, Nov 12, 2009 at 4:28 PM, Lars Francke lars.fran...@gmail.com wrote:
 - As of now the XML is not indented. I use Woodstox[1] for XML output
 and that doesn't have an option to pretty print the output. It is
 not a problem for me but if it is requested I can use StaxMate or
 something else to properly indent the XML

i'm pretty sure no-one minds. as you say, anyone who wants it indented
can easily do it with xmllint and friends.

 - Changesets: num_changes from the database isn't dumped in planet.c.
 It is queried from the database but not used anywhere. The data _can_
 be calculated but it isn't that easy if not using the standard db
 schema and not easily done by reading the XML stream. I could just
 dump it too. I haven't had a look at the API if this field is set
 correctly at all?!

it should be set correctly. you're welcome to dump it out on the
changesets if you think it's useful.

 - I'm using the same technique as planet.c in regards to the output of
 the data (just streaming it to standard output), I just assume that
 this is okay? Are there any other things I'll have to change in
 comparison to the way planet.c works?

yeah. the output will be piped directly into pbzip2, most probably.

cheers,

matt

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-12 Thread Tom Hughes
On 12/11/09 17:37, Lars Francke wrote:

 There is a dump, but it's a mysql dump so not easily readable. There may be
 a planet around somewhere as well but I don't think it will be synchronised
 to the actual shutdown time or have any history.

 If you'd be willing to share the mysql dump (I of course wouldn't need
 user- or any other sensitive data) I'd try my best. It can't hurt.
 Planet won't be as useful because the history is missing.

The problem is we'll have to load the dump into mysql to remove the 
sensitive data...

 For the main (current_*) tables, yes. But not for the history tables.
 I have no estimate how may versions there are .I could count the
 current versions from all elements but if you have a number that'd be
 great.

Approximate row counts;

nodes - 860 million
ways - 72 million
relations - 1.4 million

 The statements will look something like this:
 1)
 SELECT n.id, n.version, n.timestamp, n.changeset_id, c.user_id,
 n.visible, n.latitude, n.longitude
 FROM nodes n
 JOIN changesets c ON n.changeset_id=c.id
 ORDER BY n.id, n.version

 2) SELECT id, version, k, v FROM node_tags ORDER BY id, version, k

 Perhaps you could just check them?

They should be fine - the sort means they will take a while to start 
returning data but they're not doing anything silly.

Tom

-- 
Tom Hughes (t...@compton.nu)
http://www.compton.nu/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-12 Thread Lars Francke
A quick status update and a link to the code.

- I decided to dump num_changes too

- One thing that startled me: planet.c converts _all_ relation member
roles to lower case before dumping them. I'd consider this a bug but
I'm sure there is a reason for this. Considering that neither the API
nor Osmosis or the mysql versions of planet.c do this it is bound to
create inconsistencies. I had a look at the documentation in the Wiki
and couldn't find a reference to roles having to be lower case. So I
decided to dump them as they are in the database

- I'm done with the basic functionality. I've run _very limited_ tests
but I plan to generate test data tomorrow to see how it fares (thanks
Tom for the row counts)

The source can be found at: http://bitbucket.org/lfrancke/historydump/
It is written in Java and uses Maven2. A simple mvn package should
build a jar file but I can upload one if neccessary. It has three
dependencies: WoodStox, PostgreSQL JDBC driver and Apache Commons CLI.
Any feedback is welcome.
Run it like this: java -jar
historydump-1.0-SNAPSHOT-jar-with-dependencies.jar --help

I'll write again once I've tested this more thoroughly and then the
decision will be in your hands :)

Good Night,
Lars

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-12 Thread Lars Francke
 There is a dump, but it's a mysql dump so not easily readable. There may
 be
 a planet around somewhere as well but I don't think it will be
 synchronised
 to the actual shutdown time or have any history.

 If you'd be willing to share the mysql dump (I of course wouldn't need
 user- or any other sensitive data) I'd try my best. It can't hurt.
 Planet won't be as useful because the history is missing.

 The problem is we'll have to load the dump into mysql to remove the
 sensitive data...

That's what occurred to me too after I sent the mail. This is a nice
to have-thing but by no way a necessity. Please let me know if I can
be of any assistance as I'm very interested in the data. Otherwise
I'll just bug you every few months ;-)

If Matt is correct when he says that we'd need a full dump anyway on a
license change wouldn't that include this old mysql data? I may be far
off here. License issues are way over my head.

Cheers,
Lars

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Matt Amos
On Wed, Nov 11, 2009 at 6:41 AM, Lars Francke lars.fran...@gmail.com wrote:
 There are a few questions that probably need answering first and I
 hope we can start a discussion about this:
 - Am I correct in assuming that there are no general objections from
 the OSM server folks against such a dump? (Which would render the rest
 of this E-Mail useless ;-)

the response has always been if someone writes it, and it's good,
we'll run it :-)

 - Is anyone else currently working on this?

for some values of working, yes. it's on my list of things to do for
the license change plan - clearly we'll need a full data dump before
we can re-license.

 - Which format should the data be dumped in

(3) is the easiest to get done and most easily supported, in my opinion.

 - Distribution of the data and storage space requirements

i have a feeling that the data, while big, won't be so big that the
usual method of planet.osm.org + heanet mirror won't work.

 - Interval of dumps

based on back-of-the envelope calculations, a full dump in planet
format would take something like 7-10 days to do in parallel with
normal server activity. so it couldn't be run every week and would
probably be cumbersome to do every month. in my opinion, we should be
looking at every 3-6 months.

 3) A dump of all OSM elements in OSM format
 (http://www.openstreetmap.org/api/0.6/node/60078445/history)

this is my favourite method as well. the easiest approach would be to
modify planet.c to dump the full history, instead of just the
current_* tables.

note that brett has been working on option (2) by using osmosis to
dump very historical diffs going back to the inception of the
database. you can see the experimental results in
http://planet.openstreetmap.org/history/

for my money, if we do both (2) and (3), then we cater for all
consumers, and in a standard format. the output of the COPY command,
while good for backups, isn't really suited to dumping the information
that we have in the planet (given there will be edits by users who are
still not public, etc...)

if you want to get started hacking on planet.c then i'm happy to help.
otherwise i'm hoping to get around to it by the end of the month, but
there are never any guarantees ;-)

cheers,

matt

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Frederik Ramm
Hi,

Lars Francke wrote:
 I understand that a lot of this data is
 available throughout the web using old snapshots and diffs but this
 comes in outdated formats and is by no way complete or easy to use. 

Keep in mind that while a full database dump will give you some things 
that are not in the old planet files, but the reverse is true as well - 
there is information in the old planet files (pre-0.5) that is not in 
the database and thus will not be part of a history dump. Specifically 
this applies to pre-0.5 way history. This is not really a big deal 
execpt for those who would hope to make OSM history animations going 
back farther than API 0.5.

Bye
Frederik


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Lars Francke
 I understand that a lot of this data is
 available throughout the web using old snapshots and diffs but this
 comes in outdated formats and is by no way complete or easy to use.

 Keep in mind that while a full database dump will give you some things that
 are not in the old planet files, but the reverse is true as well - there is
 information in the old planet files (pre-0.5) that is not in the database
 and thus will not be part of a history dump. Specifically this applies to
 pre-0.5 way history. This is not really a big deal execpt for those who
 would hope to make OSM history animations going back farther than API 0.5.

I had not thought of that. When I first used OSM, segments were long
gone so I tend to forget those. There are a few planet dumps from
those time but none before 060403 and no diffs (which would be
required to fully reconstruct the history). While I'm of course
interested in the most complete history possible I don't know if this
data would be easy to integrate. As I understand it all the old
segments that weren't part of a way were converted to a way and all
0.4-ways were just migrated to 0.5-ways (segments to node reference).
I suppose segments that were part of at least one way were not
converted to a way?

So we'd need to:
- Find the segments for previously unwayed segment-ways and
incorporate their history into the way. As the new way starts at
version 1 this would have to be a hack (version 0, counting backwards
or something like that).
- For ways we'd need to find the history data of the segments they
were made of and merge that into the history of the way, again
requiring versions before 1

I don't know if it is worth the trouble but on the other hand it would
be nice to have a complete history. Especially as this would have to
be done only once. But as I said: My knowledge of pre 0.5 times is
limited at best and I'd be happy if you/someone else could tell me if
what I wrote makes sense. I'd certainly be willing to have _a look_ at
this, too.

Thanks for pointing this out!

Lars

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Lars Francke
Andy,

 Ultimately the formation of a mini project is probably needed. Input from
 those like yourself willing to work on it and the will and time from others
 who would need to support the work, including sysadmins.

I had hoped that this could be (re-)solved rather unbureaucratic and
without the need to involve the Wiki :)
The wiki tends to drag discussions on

 I'm sure its all possible, but like so many things in OSM it also has to be
 practical and realistic to have any real chance of gathering momentum.

Your and the other answers suggest to me that it is practical and
realistic, thanks! At the moment I'm fully motivated to do what I
can/need to get the data I want. Hopefully that's enough momentum.

Cheers,
Lars

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Frederik Ramm
Hi,

Lars Francke wrote:
 I had not thought of that. When I first used OSM, segments were long
 gone so I tend to forget those. There are a few planet dumps from
 those time but none before 060403 and no diffs (which would be
 required to fully reconstruct the history). 

No, diffs only show the changes between two points in time, not what 
happened in between; so they cannot be used to fully reconstruct history.

 While I'm of course
 interested in the most complete history possible I don't know if this
 data would be easy to integrate. 

No, it would probably be hard.

 As I understand it all the old
 segments that weren't part of a way were converted to a way and all
 0.4-ways were just migrated to 0.5-ways (segments to node reference).
 I suppose segments that were part of at least one way were not
 converted to a way?

I think so too, but I am unsure what happened to tagged segments. Also 
we used to have lots of unordered ways where a number of segments were 
part of a way but not in a sorted order. Sometimes they weren't even 
contingent and thus had to be split in multiple ways when 0.5 was 
introduced. Also, we used this deliberately to model areas with holes 
(two chains of segments, one clockwise, one counter-clockwise, being 
part of the same way - that was your area with a hole!). These would 
have to be retro-fitted into multipolygons for every point in history. 
It is not difficult to do it once but to do it for two points in history 
and hope to assign the resulting changes to the same virtual relation id 
is... a challenge.

 - For ways we'd need to find the history data of the segments they
 were made of and merge that into the history of the way, again
 requiring versions before 1

Yes, we briefly thought about something like that when we did the 
0.4-0.5 migration (a synthesized history if you will) but dropped the 
idea due to its complexity.

Bye
Frederik

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Lars Francke
 - Am I correct in assuming that there are no general objections from
 the OSM server folks against such a dump? (Which would render the rest
 of this E-Mail useless ;-)

 the response has always been if someone writes it, and it's good,
 we'll run it :-)

That's all I wanted to hear :)

 (3) is the easiest to get done and most easily supported, in my opinion.

Once more: That's all I wanted to hear!

 - Distribution of the data and storage space requirements

 i have a feeling that the data, while big, won't be so big that the
 usual method of planet.osm.org + heanet mirror won't work.

I'll have to rely on your word there but that sounds good too. One
thing less to worry about.

 - Interval of dumps

 based on back-of-the envelope calculations, a full dump in planet
 format would take something like 7-10 days to do in parallel with
 normal server activity. so it couldn't be run every week and would
 probably be cumbersome to do every month. in my opinion, we should be
 looking at every 3-6 months.

Sounds reasonable enough. I don't know how much demand there is for
this data anyway.

 note that brett has been working on option (2) by using osmosis to
 dump very historical diffs going back to the inception of the
 database. you can see the experimental results in
 http://planet.openstreetmap.org/history/

 for my money, if we do both (2) and (3), then we cater for all
 consumers, and in a standard format. the output of the COPY command,
 while good for backups, isn't really suited to dumping the information
 that we have in the planet (given there will be edits by users who are
 still not public, etc...)

Indeed, (2) _and_ (3) would be the best solution! I had noticed the
history diffs but didn't know their status. Thanks Brett for
clarifying it!

 if you want to get started hacking on planet.c then i'm happy to help.
 otherwise i'm hoping to get around to it by the end of the month, but
 there are never any guarantees ;-)

The last time I've programmed in C is quite a while back (same goes
for C++ which is used by the postgres part of the dump program if I'm
not mistaken)...so I'll have a look at it but I'm more comfortable
with Java (or Python, Erlang, Ruby, ...). So I'll see what I can do
and inform you about my progress. The worst that can happen is that we
have two working solutions for the same problem. Not too bad :) I had
brief discussions with Brett about Osmosis and incorporating certain
changes into it so I've spent quite some time in its source code.
Having said that: I probably won't program this as a new task for
Osmosis but as a standalone program as this probably won't be used
widely and doesn't justify the extra work required to incorporate this
into Osmosis.

Thanks for your response. I'm hopeful now that this can be done!

Lars

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Matt Amos
On Wed, Nov 11, 2009 at 1:29 PM, Lars Francke lars.fran...@gmail.com wrote:
 I had
 brief discussions with Brett about Osmosis and incorporating certain
 changes into it so I've spent quite some time in its source code.
 Having said that: I probably won't program this as a new task for
 Osmosis but as a standalone program as this probably won't be used
 widely and doesn't justify the extra work required to incorporate this
 into Osmosis.

just remember that new code = new bugs ;-)

cheers,

matt

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Lars Francke
 I had not thought of that. When I first used OSM, segments were long
 gone so I tend to forget those. There are a few planet dumps from
 those time but none before 060403 and no diffs (which would be
 required to fully reconstruct the history).

 No, diffs only show the changes between two points in time, not what
 happened in between; so they cannot be used to fully reconstruct history.

That's why I like the new replicate diffs.

Has anyone done - and kept - a complete database dump before migrating
from 0.4 to 0.5 or is the history (at least partially) lost?

 As I understand it all the old
 segments that weren't part of a way were converted to a way and all
 0.4-ways were just migrated to 0.5-ways (segments to node reference).
 I suppose segments that were part of at least one way were not
 converted to a way?

 I think so too, but I am unsure what happened to tagged segments. Also we
 used to have lots of unordered ways where a number of segments were part
 of a way but not in a sorted order. Sometimes they weren't even contingent
 and thus had to be split in multiple ways when 0.5 was introduced. Also, we
 used this deliberately to model areas with holes (two chains of segments,
 one clockwise, one counter-clockwise, being part of the same way - that was
 your area with a hole!). These would have to be retro-fitted into
 multipolygons for every point in history. It is not difficult to do it once
 but to do it for two points in history and hope to assign the resulting
 changes to the same virtual relation id is... a challenge.

Thanks for the history tour. I didn't know all that. Sounds a bit like
the Wild West of OSM :)
But I can't quite follow the multipolygon problem. I thought that
_every_ old segment has been migrated to 0.5 in one way [sic!] or
another and I would only prepend the history I can find to these
existing ways. Am I thinking too simple here?

 - For ways we'd need to find the history data of the segments they
 were made of and merge that into the history of the way, again
 requiring versions before 1

 Yes, we briefly thought about something like that when we did the 0.4-0.5
 migration (a synthesized history if you will) but dropped the idea due to
 its complexity.

I can certainly appreciate that decision. I just want to understand
what would be needed and decide if it is worthwhile to do something
about it or not.

Lars

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Complete history of OSM data - questions and discussion

2009-11-11 Thread Andy Robinson (blackadder-lists)
Lars Francke [mailto:lars.fran...@gmail.com]
Sent: 11 November 2009 1:16 PM
To: Andy Robinson (blackadder-lists)
Cc: OpenStreetMap Dev
Subject: Re: [OSM-dev] Complete history of OSM data - questions and
discussion

Andy,

 Ultimately the formation of a mini project is probably needed. Input from
 those like yourself willing to work on it and the will and time from
others
 who would need to support the work, including sysadmins.

I had hoped that this could be (re-)solved rather unbureaucratic and
without the need to involve the Wiki :)
The wiki tends to drag discussions on

The community likes to be kept informed of developments and there may be
others out there that wish to help with stuff or have a real interest in the
form and function of what is done. Only a small number of the OSM community
read the dev list. Putting stuff on the wiki doesn't necessarily mean you
are asking for discussion, but rather it's a means of communicating what you
are doing and providing a conduit for the community give feedback.

Cheers

Andy


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


[OSM-dev] Complete history of OSM data - questions and discussion

2009-11-10 Thread Lars Francke
Hi!

I and many (okay at least a few) others have shown interest in the
complete history data of OSM. I understand that a lot of this data is
available throughout the web using old snapshots and diffs but this
comes in outdated formats and is by no way complete or easy to use. I
also had a look at the System Admin page on the Wiki but I don't
really know whom to contact, thus this post on the mailing list.

My question would be what would have to be done for a complete dump of
the data. I read previous requests for this data and it seems as if
there is no general objection to such a dump but that no one has
written the proper tool for the job so far. As I have some free time
on my hands (and about a hundred ideas/requests for the data for
osmdoc) I'd be willing to at least _try_ to get something done.

There are a few questions that probably need answering first and I
hope we can start a discussion about this:
- Am I correct in assuming that there are no general objections from
the OSM server folks against such a dump? (Which would render the rest
of this E-Mail useless ;-)
- Is anyone else currently working on this?
- Which format should the data be dumped in
- Distribution of the data and storage space requirements
- Interval of dumps

* Format *
1) The easiest would be to just use the PostgreSQL COPY command
(http://www.postgresql.org/docs/8.3/interactive/sql-copy.html). This
would produce a file suitable to be read into any other PostgreSQL
database with.

Pros:
- Easy to do
- Probably one of the fastest options
- Low overhead in the file formats

Cons:
- As far as I know there is no way to compress the data stream so
everything would have to be written uncompressed first
- The binary format is not really portable or easy to use, forced to
use PostgreSQL as target, not able to filter data (Text formats
available)
- Even using text formats the data would be scattered (i.e. tags
wouldn't be stored with the elements, node references wouldn't be
stored with the ways, ...)
- No OSM tools for this formats

2) A dump of all changesets in OsmChange mode (e.g.
http://www.openstreetmap.org/api/0.6/changeset/3010332/download ). As
I understand it Changesets have been created for every change. I don't
quite understand why the first changeset (and nodes/ways) come from
sometime in 2005 and not 2004 but I bet someone here can enlighten me.

Pros:
- Well known data format, many tools can work with OsmChange
- Good if the user wants to rebuild/relive the change events as the
Changesets should come roughly in the correct order/timeline
- Possibility to split the process in multiple parts (e.g. history
files with 50.000 changesets each)
- Easy to update - Just add the new changesets (with the long running
transactions, that are 'haunting' the diffs, posing the same problem)

Cons:
- XML file size overhead (doesn't matter that much compressed)
- Probably a lot slower than the COPY method
- Custom code would have to be written to do this export but it
shouldn't be to hard to iterate over every changeset. The necessary
indexes already seem to exist
- Potentially bad if one is interested mainly in the elements itself,
the history data for a single element could be scattered throughout
the whole file

3) A dump of all OSM elements in OSM format
(http://www.openstreetmap.org/api/0.6/node/60078445/history)
Pros:
- Good if the user is interested in the elements and their history and
not the flow of changes
- Easily split in smaller files (nodes, ways, relations, changesets,
further subdivided by id ranges or something else)
- Easy to process although tools might not work out of the box

Cons:
- XML file size overhead, Custom code needed (or has Osmosis already
the possibility to do this?), slower than COPY
- This format has not that much tool support as far as I know
(multiple versions of an element in a single file)
- Best format to rebuild a custom database of OSM as it is grouped
by element and not rather arbitrarily by Changeset/date
- Not very easy to update, the whole process would have to be redone
(or changesets would have to be examined)


A few personal remarks:
- I personally favor option 3) but that is mainly because of my
requirements for osmdoc.
- I don't see missing tool support as a big problem as I suspect that
the majority of the users of this data will have/want their own tools
do analyze or store the data (just guessing).


*Distribution and space requirements*
I really can't say much about this as I have no idea of the size of
the database or the space available on the server(s). But I hope one
of the admins can tell me more about this. The planet has been
distributed using BitTorrent in the past so this might be a possible
solution for the history dump but it really is too early to tell.

*Interval of the dumps*
Theoretically only one dump would be needed as there are now the
replicate diffs which should provide every change to the database. But
as they are - at the moment - only available in 'minute'